aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
435 stars 188 forks source link

Improve the efficiency of link validation #3088

Closed sphuber closed 5 years ago

sphuber commented 5 years ago

The link validation in Node.add_incoming was written to be correct when the Node interface was rewritten and efficiency was not given priority. However, in certain use cases the validation becomes too slow and forms a bottleneck. The problem arises when there are so-called "god" nodes in the database. These are typically input nodes that are used over-and-over again for certain calculation. This is either by design, e.g. UpfData nodes, or in an attempt by the user to overcome another inefficiency of AiiDA that duplicate node content is not automatically deduplicated. Meaning that if a node is stored with the exact same content of a node that is already in the database, the content is duplicated.

The problem of "god"-nodes is that if a new link is added from or to one, the link validation will need to consider all other links that it already has. This operation becomes progressively more heavy as more links are added.

If possible, the implementation of validate_incoming and validate_outgoing on the ORM level should be made more efficient. Potentially, other solutions that do not rely on the ORM, but for example perform the checks on the database level (which are potentially a lot faster) may have to be considered. See issue #2238 for example.

zhubonan commented 5 years ago

This is indeed a problem. The situation is worse for my case as I run CASTEP calculations using on-the-fly generated pseudopotentials. My calculations are linked to a few god-like nodes representing the built-in libraries. Launching a process becomes very slow (~10s) after 15K nodes are stored in the DB. There is a single pseudopoential node with ~2400 outlinks to CalcJobNode.

I think a fix can be made without moving the check at database level (in case it is difficult to do).

I find that the speed throughput limiting check is in validate_link function:

https://github.com/aiidateam/aiida-core/blob/8447c6bfc9e55bdde5843310f3d66a1d77b0e8cf/aiida/orm/utils/links.py#L126-L141

the LinkManager are slow to instanciate as all the Node connected to the source are loaded. The connected nodes are not needed in most cases (until error message to be printed).

Instead, the link types, pks of the targets and the link label can be retrieved relatively fast using Querybuilder:

q = QueryBuilder().append(Node, filters={'id': source.id})
q.append(Node, edge_project=['type', 'output_id', 'label'])
q.all()
...

The results can be used checking different outdegree, replace the need for using get_outgoing.

@sphuber What do you think? Should I make a pull request or is it better for one of the core developers to fix? Obviously a database level check would potentiall be even faster.

giovannipizzi commented 5 years ago

Thanks a lot @zhubonan for this! You can definitely make a PR if you are willing to! Maybe before that, would you be able to provide some timings before and after your suggested change, to discuss briefly if this fix is enough? Thanks!!!

sphuber commented 5 years ago

I will second @giovannipizzi , thanks a lot for the input. The most important thing to solve this issue is to have proper benchmarks so we know in what manner the fix is addressing the problem. If you can provide some initial tests on your own database to show the improvement, we can go from there.