aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 189 forks source link

QueryBuilder: with_ancestors does not work as some might expect (+ speed issues) #3755

Open ltalirz opened 4 years ago

ltalirz commented 4 years ago

Consider the following simple graph graph

With the following query

qb=QueryBuilder()
qb.append(Int, filters={ 'uuid': '8dba489c-6c37-49b8-9bdd-b3bbb0c19044' }, tag='top')
qb.append(Node, filters={ 'uuid': '77ffe2ac-82ae-4e6f-b972-b6604cc94671'}, with_ancestors='top')

I would expect exactly one result (the node at the bottom).

However, I get this:

In [6]: qb.all()
Out[6]:
[[<Float: uuid: 77ffe2ac-82ae-4e6f-b972-b6604cc94671 (pk: 27723) value: 256.0>],
 [<Float: uuid: 77ffe2ac-82ae-4e6f-b972-b6604cc94671 (pk: 27723) value: 256.0>]]

The reason is that there are two paths to arrive at the final node (one through the left, and one through the right). Discovered with @danieleongari

Mentioning @giovannipizzi and @lekah for comment.

P.S. @giovannipizzi We did test the with_ancestors approach for the CURATED COFs database and with the current implementation it is very slow. As you can imagine, you can easily get back the same node hundreds of times.

P. P.S. Python script to create the graph

from aiida.engine import calcfunction
from aiida.orm import Int

@calcfunction
def divide(inp):
    return {'half_1': inp/2, 'half_2': inp/2}

@calcfunction
def sum(inp1, inp2):
    return inp1 + inp2

a = Int(256)
print(a.uuid)
b, c = divide(a).values()
d = sum(b,c)
CasperWA commented 4 years ago

I guess the simple solution would be to use sets somewhere in the method/function? But this will not solve the speed issues, as well as the fact that all "similar" edges in the graph are probed.

lekah commented 4 years ago

I'd say your example works exactly as expected. Same if you have 2 links connecting two Nodes (1 and 2), you'd get node2 twice traversing from node1. It's because the Cartesian product (the join) has 2 solutions, as is the case in above example. If you want unique solutions, use qb.distinct().all(), that is asking the backend to return unique solutions.

ltalirz commented 4 years ago

I'd say your example works exactly as expected.

Hehe, it may do what you expect, but let me assure you that this is not what a new AiiDA user expects when asking for a node X that has ancestor Y.

I also believe that finding unique nodes with a given set of ancestors is the most common use case of with_ancestors - at least it is the one that @giovannipizzi suggested as a recommendation to tag material properties in high-throughput calculations.

I believe we should support this use case.

If you want unique solutions, use qb.distinct().all(), that is asking the backend to return unique solutions.

Thanks, I wasn't aware of this. We gave it a try, but unfortunately this does nothing to improve performance (I guess the internal postgres query is still trying to find all possible paths?).

In our example, we are querying a DB of 500 materials for a single property each:

qb = QueryBuilder()
qb.append(Group, filters = {"label":{"like": "curated-cof_%"}}, tag = "group")
qb.append(CifData, filters={'extras.curated-cof_tag': 'opt_cif_ddec'}, with_group="group", tag="cif_ddec")
qb.append(Dict, filters={'extras.curated-cof_tag': 'pe_out'}, with_ancestors='cif_ddec')
qb.distinct().all()

This query takes 30s, which is about 400 times slower than getting the same info via an alternative method where we just put the corresponding Dict in a Group and look up via the group (73.5ms).

@giovannipizzi : Before we are going to suggest this to users, I guess we'll need another implementation of this that stops after having found the first path.

danieleongari commented 4 years ago

After further testing we found a significant discrepancy in the timing of these two equivalent queries:

qb = QueryBuilder()
qb.append(Node, project=['label'], filters={'id': 530855}, tag='cif') # CifData ANUGIA
qb.append(Node, project=['attributes.is_porous'], with_ancestors='cif', filters={"extras.curated-cof_tag": "pe_out"})
qb.dinstinct().all()

and

qb = QueryBuilder()
qb.append(Group, filters={'id': '2382'}, tag='group') # Containing only CifData ANUGIA
qb.append(Node, project=['label'], with_group='group', tag='cif')
qb.append(Node, project=['attributes.is_porous'], with_ancestors='cif', filters={"extras.curated-cof_tag": "pe_out"})
qb.dinstinct().all()

they both return the material's label and the property of interest, e.g.,

[['ANUGIA', True]]

but while the first takes 27ms the seconds takes as much as 144second (ca. 4 orders of magnitude more!). This is an inefficiency (or a bug) that needs to be solved. As a way to circumvent it at the moment we can do 2 queries: the first return a list of PKs from a group and the second takes this lister as filters/id/in and continues with the query to returns the property of interest. We need to test this option more, but this is only 2x slower than our current method of having groups of important Dicts from which to parse the properties (instead of using with_ancestors).

Mentioning @giovannipizzi for discussion.

danieleongari commented 4 years ago

With a more realistic testing on 500 materials (ca. 1000 nodes for each material) the parsing of a single property takes 10x more when using the "with_ancestors" query, instead of picking these property nodes directly from a group (the method we are currently using for https://www.materialscloud.org/discover/curated-cofs#mcloudHeader): 5s vs 0.5s, meaning that, for real use, one can consider some caching to think about using the "with_ancestors" query.

giovannipizzi commented 4 years ago

In terms of functionality, I agree with @lekah that the current behaviour is the intended one (find all possible paths that connect the two nodes). If this wasn't clear to @ltalirz, then maybe we should clarify the documentation (phrasing and/or PR welcome).

On the other hand:

For both of these: @danieleongari, could you do a print of the QB object? this should print the query itself. Then, it would be also great if you could open a PSQL and run the query, prepending it with EXPLAIN ANALYZE (i.e., EXPLAIN ANALYZE SELECT [...]) and report the output in the various cases. This will report not only the time, but also which choices the DB did (e.g. if it used the indexes, ...) and clarify better why this happens.

Great if you could report here, and possibly then we can meet in person if we need to do more debugging.

Finally, I feel that we are going in the right direction in terms of speed improvement, and we can still gain something with some improvements, after which the speed should be more than good for the website. And even if this is not the case, as we discussed in person, for that specific use case one can think to custom caching of the query results (happy to discuss more).

lekah commented 4 years ago

For the slowness: as @danieleongari reports, it is very weird - the slowness does not come from the ancestors part, but from its combination with a group. This is a bit surprising for me.

The explanation is the following: the with_ancestors/with_descendant query works with a common table expression, essentially it needs to construct - in memory - a table with the necessary paths on the fly when you ask for it, since storing all these paths is prohibitive (see #258).

The question is now, how big is the size of this table in memory storing all the paths? The worst-case scenario is all possible paths in the AiiDA graph, which is probably the case in the second, slow example. (The PSQL engine does not by itself figure out that this is much more expensive than first figuring out which nodes are necessary). As to why the first example is so fast: If the user provides specific filters on the origin node(s) of the paths, these are incorporated in the common-table-expression (explicitly by us, see QueryBuilder._join_descendants_recursive). Therefore, the engine is told to construct this table for a subset of origin nodes given by the query. Again, this difference in speed is therefore (for me) expected behavior, although maybe not well documented. This is what I remember from the tests we did 2 years ago and from the implementation, the EXPLAIN ANALYZE would indeed help to confirm or improve this explanation.

As to for fixing these speed issues: we were aware of it and couldn't figure out a way to tell the engine to perform the query-ops in a specific order. If someone manages to force the engine to create the paths-"table" last for only the nodes that are necessary, that would solve the issue (but maybe create other issues). It's not easy though...

giovannipizzi commented 4 years ago

Indeed, I had forgotten what @lekah writes:

If the user provides specific filters on the origin node(s) of the paths, these are incorporated in the common-table-expression (explicitly by us, see QueryBuilder._join_descendants_recursive).

(link to the function) and

couldn't figure out a way to tell the engine to perform the query-ops in a specific order.

This is also confirmed by the following, where I paste the actual SQL queries (different PKs):

%%timeit
qb = QueryBuilder() 
qb.append(Node, project=['label'], filters={'id': 638}, tag='cif') 
qb.append(Node, project=['id'], with_ancestors='cif') 
qb.distinct().all() 

output of print(qb):

WITH RECURSIVE anon_2(ancestor_id, descendant_id, depth) AS 
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth 
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id 
WHERE CAST(db_dbnode_3.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_3.id = 638 AND db_dblink_1.type IN ('create', 'input_calc') UNION ALL SELECT anon_2.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_2.depth + CAST(1 AS INTEGER) AS current_depth 
FROM anon_2 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_2.descendant_id 
WHERE db_dblink_2.type IN ('create', 'input_calc'))
 SELECT DISTINCT db_dbnode_1.label, db_dbnode_2.id 
FROM db_dbnode AS db_dbnode_1 JOIN anon_2 AS anon_1 ON anon_1.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_1.descendant_id = db_dbnode_2.id 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND db_dbnode_1.id = 638 AND CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%'

With this instead:

%%timeit
qb = QueryBuilder()
qb.append(Group, filters={'id': '25'}, tag='group')
qb.append(Node, project=['label'], with_group='group', tag='cif')
qb.append(Node, project=['id'], with_ancestors='cif')
qb.distinct().all()

we get

WITH RECURSIVE anon_2(ancestor_id, descendant_id, depth) AS 
(SELECT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id, CAST(0 AS INTEGER) AS depth 
FROM db_dbnode AS db_dbnode_3 JOIN db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id 
WHERE CAST(db_dbnode_3.node_type AS VARCHAR) LIKE '%%' AND db_dblink_1.type IN ('create', 'input_calc') UNION ALL SELECT anon_2.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id, anon_2.depth + CAST(1 AS INTEGER) AS current_depth 
FROM anon_2 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_2.descendant_id 
WHERE db_dblink_2.type IN ('create', 'input_calc'))
 SELECT DISTINCT db_dbnode_1.label, db_dbnode_2.id 
FROM db_dbgroup AS db_dbgroup_1 JOIN db_dbgroup_dbnodes AS db_dbgroup_dbnodes_1 ON db_dbgroup_dbnodes_1.dbgroup_id = db_dbgroup_1.id JOIN db_dbnode AS db_dbnode_1 ON db_dbnode_1.id = db_dbgroup_dbnodes_1.dbnode_id JOIN anon_2 AS anon_1 ON anon_1.ancestor_id = db_dbnode_1.id JOIN db_dbnode AS db_dbnode_2 ON anon_1.descendant_id = db_dbnode_2.id 
WHERE db_dbgroup_1.id = '25' AND CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%'

Note the difference: in the first one, there is the filter AND db_dbnode_3.id = 638 while creating the recursive expression; in the second one the filter on the group is only after having created the full transitive closure for the whole DB. Suggestions welcome!

So, I think there are 2 aspects, independent and both relevant:

giovannipizzi commented 4 years ago

As a note, if you try this on a small, disconnected DB the timings will be very similar (e.g. I get ~20ms for both on a DB of only 2000 nodes)

giovannipizzi commented 4 years ago

One comment for @danieleongari - for performance, it would be interesting to check if you can rewrite your query using with_descendants instead of with_ancestors - I think that with_desendants will try to go upwards in the provenance, and typically you have many less branches (since a data can only have ONE creator) while going downwards can be much more expensive (a data can have very many calculations using it). To check if the query that is generated will be able to embed inside itself the filter, otherwise we get the same problem that you see with groups.

A second comment, also for @danieleongari: I was looking at how to avoid to get duplicates. I simplified a bit the query (removing the last part that is not really interesting for the purpose of this discussion) and I tried to remove the depth and unify results. Here is the result (I changed SELECT to SELECT DISTINCT, even if probably this is not needed; and most importantly I replaced UNION ALL with UNION, that removes duplicates.

One needs to replace the PK of the node in the query, but it would be interesting to know if this is (much?) faster than the query with the depth in your case (note that here I removed also all filters, so maybe the query below needs some adaptation).

WITH RECURSIVE anon_2(ancestor_id, descendant_id) AS 
(
    SELECT DISTINCT db_dblink_1.input_id AS ancestor_id, db_dblink_1.output_id AS descendant_id
    FROM db_dbnode AS db_dbnode_3 
    JOIN 
        db_dblink AS db_dblink_1 ON db_dblink_1.input_id = db_dbnode_3.id 
    WHERE 
        CAST(db_dbnode_3.node_type AS VARCHAR) LIKE '%%' 
        AND db_dbnode_3.id = 133
        AND db_dblink_1.type IN ('create', 'input_calc') 
    UNION 
        SELECT DISTINCT anon_2.ancestor_id AS ancestor_id, db_dblink_2.output_id AS descendant_id
        FROM anon_2 JOIN db_dblink AS db_dblink_2 ON db_dblink_2.input_id = anon_2.descendant_id 
        WHERE
            db_dblink_2.type IN ('create', 'input_calc')
)
SELECT db_dbnode_1.label, db_dbnode_2.id 
FROM db_dbnode AS db_dbnode_1 
JOIN
    anon_2 AS anon_1 ON anon_1.ancestor_id = db_dbnode_1.id
    JOIN
        db_dbnode AS db_dbnode_2 ON anon_1.descendant_id = db_dbnode_2.id 
    WHERE
        CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%'
        AND db_dbnode_1.id = 133
        AND CAST(db_dbnode_2.node_type AS VARCHAR) LIKE '%%';

EDIT: next optimisation strategy - see if it's possible in some way to 'stop' the query as soon as a relevant node is found? Not sure it's possible, though.

ltalirz commented 4 years ago

Just for reference, we had a discussion with @danieleongari , @ramirezfranciscof & @giovannipizzi , notes can be found here

Recommendations:

Todo

zhubonan commented 3 years ago

I think I also found a performance issue of with_ancestors, and it is related to the use of Group in the query.

If I have a group of nodes and I query their children using:

q = QueryBuilder()
q.append(Group, filters={'label':'mygroup' })
q.append(StructureData, with_group=Group, project=['id', 'label'], tag='root')
q.apennd(StructureData, with_ancestors='root')
q.count()

is extremely slow, and almost never finishes. Here, the structure data can have multiply relaxed (intermediate structures). I can see lots of disk activities of the postgres process through the process manager.

On the other hand, search in the reverse direction seems to work better:

q = QueryBuilder()
q.append(StructureData, tag='child')
q.append(StructureData, with_descendants='child', tag='root')
q.append(Group, filters={'label':'mygroup' }, with_node='root')
q.count()

and actually can finish. There about 40 nodes and 2000 children structures. In my computer, it took about 16s.

So far the best way is achieved by querying each Node member of the group separately:

counts = 0
q = QueryBuilder()
q.append(Group, filters={'label':'mygroup' })
q.append(Node, with_group=Group, project=['id', 'label'], tag='root')
for (pk, label) in q.all():
    qb =  QueryBuilder()
    qb.append(StructureData, filters={'id': pk}, project=['*', 'label'], tag='root')
    qb.append(StructureData, project=['*', 'label'], with_ancestors='root')
    counts += qb.count()

as it takes less than a seconds to finish and gives the same results.

I supposed the difference lies in the internals of how the SQL queries are constructed and can be difficult to change. Perhaps it is worth letting QueryBuilder issue some warnings when the user requests to potentially freezing queries?

giovannipizzi commented 3 years ago

Yes, unfortunately with 'standard' PostgreSQL syntax, and without injecting postgres functions, it's hard to make it very efficient (or at least I didn't manage, but I'm getting convinced it's not possible). In your case, you are not making a recursive query from a given node but from a group of nodes. The query ends up, in practice, creating in memory the transitive closure of the whole DB, to then query for the required relation and discard it...

Could you try to reconvert your last query using the AiiDA Graph Explorer (AGE)? Maybe @ramirezfranciscof can help you. If this is very efficient, maybe the best becomes checking if we can have a flag so that with_ancestors internally converts te query in a AGE query (assuming it's possible?)