aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 191 forks source link

Counterintuitive Behaviors of `contains` Filter Operator over Arrays in PostgreSQL Backend #6618

Open rabbull opened 6 days ago

rabbull commented 6 days ago

Describe the bug

The contains operator on the PostgreSQL backend lacks comprehensive documentation. To better understand its exact behavior, I conducted additional tests in #6617. However, some of the observ /ed behaviors are counterintuitive, and these are detailed below:

1. Non-Existent attr_key

import pytest
from aiida.orm import Dict, QueryBuilder

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [114, 514]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.oops': {'contains': []},
    })
    print(len(qb.all()))  # prints 0

    qb = QueryBuilder().append(Dict, filters={
        'attributes.oops': {'!contains': []},
    })
    print(len(qb.all()))  # also prints 0

In a test where the attributes column does not contain a key named oops, the query executes successfully. However, the results are confusing: neither the affirmation nor the negation of the contains operation matches the entry. This behavior is unexpected and counterintuitive. What best fits my expectation is to fail loudly, e.g. to raise an Exception.

2. Nested Array

This is a known issue discussed in a previous thread on discourse.

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [[1, 2], [3]]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[4]]},
    })
    assert len(qb.all()) == 0  # OK

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[2]]},
    })
    assert len(qb.all()) == 0  # AssertionError: assert 1 == 0

When testing with nested arrays, the contains operation unexpectedly matches entries even when the contained elements do not strictly align with the expected structure. For example, a query attempting to match [2] against an array [[1, 2], [3]] may return a match, even though [2] is not directly an element of the array.

Note that this actually complies with PostgreSQL's native JSONB containment semantics:

postgres=# select '[[1, 2], [3]]'::jsonb @> '[[2]]'::jsonb;
 ?column?
----------
 t
(1 row)

postgres=# select '[[1, 2], [3]]'::jsonb @> '[[4]]'::jsonb;
 ?column?
----------
 f
(1 row)

However, this behavior contracts with the intution quite a lot, and consequently makes the abstraction hard to be understood well.

In addition, it should also be nice to mention in the documentation that contains doesn't care the order of arrays. See example below:

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [[1, 2], [3]]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[2, 1]]},
    })
    assert len(qb.all()) == 1  # OK

Steps to reproduce

See above.

Expected behavior

See above.

Your environment

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

This might be issues of SQLAlchemy and need further investigation.

rabbull commented 4 days ago

When implementing this operator on the SQLite backend, I gained a clearer understanding of its semantics, especially in the context of nested arrays. It's overly simplistic to treat this operation as a purely mathematical "is-subset" or "contains" comparator. Instead, if we conceptualize JSON objects as trees, this contains operator effectively checks whether the operand is a connected subgraph of the tree.

For instance, consider the JSON arrays [[1, 2], 3] and [[2]]. If we visualize them as trees (as shown below):

lhs

rhs

It becomes evident that [[2]] is a sub-tree of [[1, 2], 3]. Consequently, the operation [[1, 2], 3] @> [[2]] returns true.

In contrast, if you examine [2]:

image

You’ll notice it lacks an intermediate "ARRAY" node, so [[1, 2], 3] do not contains [2].

This becomes even more apparent in cases like [[3]] and other examples, so I would prefer not to elaborate too far.

This tree-based perspective also applies to JSON objects (dictionaries in Python). For instance:

Here, {"data": {"gender": "female"}} is a sub-graph of the larger JSON object. Accordingly, PostgreSQL evaluates {"data": {"name": "Alice", "gender": "female"}, "time_modified": "1732231289"} @> {"data": {"gender": "female"}} as true.

Additionally, this tree-based interpretation clarifies why the order of elements within an array doesn’t matter: the structure is what’s being compared, not the sequence.

Another noteworthy property of this containment operator is that duplicate elements are ignored. By combining these two characteristics, the behavior of the operator becomes more intuitive and easier to understand.

cc @GeigerJ2

GeigerJ2 commented 13 hours ago

Thanks a lot for this deep-dive and taking the time to write down the explanation here, @rabbull! That actually makes a lot of sense. Feel free to add a pointer to this issue to the discourse thread you linked.