aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 187 forks source link

QueryBuilder: 'contains' operator not implemented for SQLite storage backends #6494

Open danielhollas opened 3 months ago

danielhollas commented 3 months ago

https://github.com/aiidateam/aiida-core/blob/04926fe20da15065f8f086f1ff3cb14cc163aa08/src/aiida/storage/sqlite_zip/orm.py#L287

Some discussion about implementation: https://github.com/sqlalchemy/sqlalchemy/discussions/7836

giovannipizzi commented 2 months ago

Here I put my investigations on how to possibly implement this. This follows the reasoning #6552: so, before reading below, see there first for a general reasoning.

First attempt (with SUM/GROUP BY)

Like in #6552, I work on a test DB that looks like this:

Screenshot 2024-07-25 at 12 20 12

Here is a query that works (NOTE: this is only for arrays - see next comment to extend also to dictionaries), explanation below:

SELECT 
json_type(test_1.extras, '$') as type,
test_1.id, test_1.extras,
sum(filter_cond.value is not NULL) as num_found
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
LEFT JOIN (SELECT value from json_each(json_array('a', 'c'))) AS filter_cond
ON anon_1.value = filter_cond.value 
GROUP BY test_1.id
HAVING  num_found = 2 /* this should be equal to the length of the json_array used above */

Here is the result:

Screenshot 2024-07-25 at 12 23 37

Explanation

Here, I am filtering for an array at the top level, containing the two elements 'a' and 'c' (via json_each(json_array('a', 'c'))). The query groups by all items in the list with those elements, and counts those found, thus finding - for each node or entry in the test table - the list of items that match the values you are querying for. NOTE: The last value num_found=2 must be adapted to be the length of the elements in the contains query, this should be done when generating the query. Only if the number is equal to 2 (in this example), it means that both are included, so the query returns the expected results.

Also negation should work correctly (replacing = 2 with != 2 at the end), the result:

Screenshot 2024-07-25 at 12 24 27

Note on extending to different types

One probably needs to add additional filters on the type, and use the CASE statement. An example (incorrect, but to remember the syntax) would be:

[...]
LEFT JOIN (SELECT value from json_each(json_array('a', 'c'))) AS filter_cond
ON 
CASE 
 WHEN json_type(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) = 'object' THEN anon_1.key
 WHEN json_type(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) = 'array' THEN anon_1.value
END = filter_cond.value 
GROUP BY test_1.id
[...]

Note next comment on why this is NOT the correct query.

How to implement in AiiDA (SQLAlchemy and QueryBuilder)

The same notes/discussion apply as those in #6552. In particular, using an aggregation is not ideal.

~I honestly don't know yet if this is possible with a simple filter, rather than with joins~ - I report my findings here because maybe someone has some idea, while I continue thinking about it. Pinging @danielhollas @sphuber [EDIT] Actually, this for instance seems to be working, with a correlated subquery:

select test_1.*, (SELECT COUNT(*) FROM (
      SELECT * from json_each(test_1.extras, '$') AS anon_1 
      WHERE anon_1.key IN ('c', 'b')
    )) as count from test as test_1
where count = 2

Screenshot 2024-07-25 at 13 07 34

or, even better, this query where the subquery only appears in the where statement:

select test_1.* from test as test_1
where (SELECT COUNT(*) FROM (
      SELECT * from json_each(test_1.extras, '$') AS anon_1 
      WHERE anon_1.key IN ('c', 'b')
    )) = 2

I didnt' think yet if this is easy to implement currently (maybe one just puts the subquery in the filter, and SQLA uses the right syntax - to test)

giovannipizzi commented 2 months ago

Important note on feature parity with PostgeSQL

I am not sure neither the docs of SQLAlchemy on the PSQL dialect, nor our QueryBuilder docs, really explain what contains exactly does for a dictionary (or for strings, ...).

For a list, it should return "all lists where all elements are included". For elements, I don't know, needs to be checked. For a dictionary, from the SQLAlchemy docs I was thinking that it just checks if the elements of the list are fully included in the keys of the dict. However, I think this is not the case. I think it requires a dict, and checks that the key-value pairs all exist.

Here some quick examples (not to forget), but I suggest to do proper testing before implementing the more complex query with CASE etc, to make sure we implement the same logic as in PSQL, and then we should clarify very well with examples also in our QueryBuilder docs.

Data generation

I first generate some simple data (in a PSQL backend) - note this should be done only once, or you will get multiple copies of the results!

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})

node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})

node = orm.Int(3).store()
node.base.extras.set_many({'sub': ['a', 'b', 'c']})

node = orm.Int(4).store()
node.base.extras.set_many({'sub': ['b', 'c']})

node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})

Queries

"Contains" with a list

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": ["a"]}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
print(qbuild.all())

returns

[[3, ['a', 'b', 'c']]]

"Does not contain" with a list

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"!contains": ["a"]}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[[1, {'a': 1, 'b': 2, 'c': 3}],
 [2, {'b': 2, 'c': 3}],
 [4, ['b', 'c']],
 [5, 1]]

"Contains" with a dictionary, key exists, values does not match

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": {"a": 'whatever'}}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[]

"Contains" with a dictionary, key-value pair exists

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": {"a": 1}}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[[1, {'a': 1, 'b': 2, 'c': 3}]]
giovannipizzi commented 2 months ago

Possible solution for AiiDA

After some testing (and the correlate_except is the key part of this query), this seems to be working!

        if operator == 'contains':
            from sqlalchemy import select

            if isinstance(value, (list, set)):
                subq = select(database_entity).where(func.json_each(database_entity).table_valued('value', joins_implicitly=True).c.value.in_(value)).correlate_except()
                subsubq = select(func.count()).select_from(subq).scalar_subquery()
                return subsubq == len(value)
            elif isinstance(value, dict):
                raise NotImplementedError
            else:
                raise TypeError("contains filters can only have as a parameter a list (when matching against lists) or dictionaries (when matching against dictionaries)")

If useful, here is an example script I was using for testing:

from aiida import orm, load_profile
from aiida.storage.sqlite_temp import SqliteTempBackend
temp_profile = SqliteTempBackend.create_profile('temp-profile')
load_profile(temp_profile, allow_switch=True)

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})
node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})
node = orm.Int(3).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2}})
node = orm.Int(4).store()
node.base.extras.set_many({'sub': "a"})
node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})
node = orm.Int(6).store()
node.base.extras.set_many({'sub': ["a", "b"]})
node = orm.Int(7).store()
node.base.extras.set_many({'sub': ["c", "d"]})
node = orm.Int(8).store()
node.base.extras.set_many({'sub': ["a", "c"]})
node = orm.Int(9).store()
node.base.extras.set_many({'sub': ["b", "d", "a"]})

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"~contains": ["a", "b"]}}
#filters = {"extras.sub": {"contains": ["c"]}}

qbuild.append(
    orm.Int,
    project=['attributes.value'],
    filters=filters,
)
print(qbuild.all())
print(qbuild.as_sql(True))

Still TODOs (for whoever wants to pick this up):