Reimplement `has_key` for SQLite

giovannipizzi commented 1 month ago

In #6256, a bug was identified in the previous implementation of the has_key for SQLite. The bug is described there. To avoid returning wrong results, the feature has been disabled. We should re-enable it.

giovannipizzi commented 1 month ago

Here I explain the issue - I describe a bit more as I go through my thought process, that might be relevant for the implementation of the contains operator ~(issue to be opened yet)~ - see #6494.

Old problem

Let's consider this DB (I'm using SQLite Browser)

Let's first understand how json_each works in SQLite:

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE json_type(test_1.extras, '$') = 'object'

(here I'm just filtering on the first three rows that contain an object (i.e. a dict).

json_each has more columns, but for now I just focus on the key and value. Essentially, this is implicitly joining the two tables, and returning now not a row per entry in the test table, but a row per key in the top-level dictionary (in the extras column).

Therefore, the old query was simply filtering on these if there was a key with the correct value, something like

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE anon_1.key = 'b'

The problem was the negation: now, the negation is applied to the whole filter! So this would be

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE anon_1.key != 'b'

Indeed, this would return a result for every key-value pair where the key is not b, explaining #6256. This is clearly not what we want.

giovannipizzi commented 1 month ago

First attempt to a solution

The first attempt to a solution I did is with an aggregation sum and a group_by on the ID, essentially counting all rows per entry (corresponding to a DbNode in our case). And filtering only on those that have exactly 1 match:

SELECT 
SUM(anon_1.key = 'b') as count_keys, test_1.id, test_1.extras
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
GROUP BY test_1.id
HAVING  count_keys = 1

This is correct, and now also the negation works!

SELECT 
SUM(anon_1.key = 'b') as count_keys, test_1.id, test_1.extras
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
GROUP BY test_1.id
HAVING  count_keys != 1

Note that it correctly shows lists as the has_key should only be applied to dictionaries. This is not yet the correct query as it does not show integers, strings, etc.. However, there is a more technical subtle detail: the way QueryBuilder is modularized and implemented in SQLAlchemy, we just want to add a filter, and this is passed up through 3-4 function calls, and the actual query is constructed in a very different function. Replacing the whole query with SUM, GROUP_BY, ... is very difficult (and I'm not even sure how this works, even in pure SQL, if we have many filters on various nodes).

giovannipizzi commented 1 month ago

The solution

The final solution I found is to use simply this query, using the exists statement of SQL - this does not require a GROUP BY and not even explicit joining!

select test_1.* from test as test_1
WHERE EXISTS (
  SELECT * from json_each(test_1.extras, '$') AS anon_1 
  WHERE anon_1.key = 'b'
)

The negation becomes not exists, and it actually works pretty well, also on strings, ints, ... without special CASE statements!

select test_1.* from test as test_1
WHERE NOT EXISTS (
  SELECT * from json_each(test_1.extras, '$') AS anon_1 
  WHERE anon_1.key = 'b'
)

giovannipizzi commented 1 month ago

Implementing in AiiDA (SQLAlchemy)

The implementation is now pretty straightforward - this seems to be doing the job:

        if operator == 'has_key':
            from sqlalchemy import select
            return select(database_entity).where(func.json_each(database_entity).table_valued('key', joins_implicitly=True).c.key == value).exists()

How to test it

Here is an extended test from the one of @mbercx in #6256:

from aiida import orm, load_profile
from aiida.storage.sqlite_temp import SqliteTempBackend
temp_profile = SqliteTempBackend.create_profile('temp-profile')
load_profile(temp_profile, allow_switch=True)

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})
node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})
node = orm.Int(3).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2}})
node = orm.Int(4).store()
node.base.extras.set_many({'sub': "a"})
node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})
node = orm.Int(6).store()
node.base.extras.set_many({'sub': ["a", "b"]})
node = orm.Int(7).store()
node.base.extras.set_many({'sub': ["c", "d"]})

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"!has_key": "a"}}

qbuild.append(
    orm.Int,
    project=['id', 'attributes.value'],
    filters=filters,
)
print(qbuild.all())
print(qbuild.as_sql(True))

That prints

[[2, 2], [4, 4], [5, 5], [6, 6], [7, 7]]
SELECT db_dbnode_1.id, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, '$."value"')) AS anon_1 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'data.core.int.%' ESCAPE '\' AND NOT (EXISTS (SELECT JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"')) AS anon_2 
FROM json_each(JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"'))) AS anon_3 
WHERE anon_3."key" = 'a'))

while replacing without negation: filters = {"extras.sub": {"has_key": "a"}} returns

[[1, 1], [3, 3]]
SELECT db_dbnode_1.id, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, '$."value"')) AS anon_1 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'data.core.int.%' ESCAPE '\' AND (EXISTS (SELECT JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"')) AS anon_2 
FROM json_each(JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"'))) AS anon_3 
WHERE anon_3."key" = 'a'))

This seems the correct result

giovannipizzi commented 1 month ago

Conclusion

This provides the implementation that needs to be put in AiiDA. I didn't test it much more, and unfortunately I won't have time until the second half of August - so I'm putting this here in case someone who has time wants to implement it and add some tests (pinging @sphuber and @danielhollas as they were involved earlier, in addition to @mbercx)

Also pinging @khsrali @eimrek @GeigerJ2 and @agoscinski as this is relevant for the support of Materials Cloud REST APIs directly via SQLite backends, and also for later benchmarking of performance.

Note on tests

I would add many tests, querying directly extras, some sub key of extras, in various forms, both dicts and lists, ints, floats, strings, bools, none - to make sure the queries work, and also that the result is identical between SQLite and PSQL

aiidateam / aiida-core