aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 189 forks source link

`set_extra` gets slower and slower in iterations of loop #1456

Closed ltalirz closed 1 year ago

ltalirz commented 6 years ago

Consider the following script for setting extras to a set of ParameterData nodes

It first does a join query (which runs fine) and then starts adding extras, looping over the results of the query:

from aiida.orm.data.cif import CifData
from aiida.orm.data.parameter import ParameterData

# Query for Cif uuids, ParameterData nodes
qb=QueryBuilder()
qb.append(CifData, project=['uuid'], tag='cifs')
qb.append(ParameterData, project=['*'], descendant_of='cifs')
print("Performing query")
qb.limit(7000)
results = qb.all()
nresults = len(results)

# change this to 350, 700, 1400 to see the problem
results = results[:350] 

print("Adding extras")
for cif_uuid, pm_node in results:
    pm_node.set_extra('cif_uuid', str(cif_uuid))

# alternative without loop using map
#map(lambda r: r[1].set_extra('cif_uuid', str(r[0])), results)

The problem is that the runtime of the loop increases superlinearly with the number of items:

#items  time spent in loop
 350 6.6s
 700 20.1s
1400 70s

We need this to work for tens or hundreds of thousands of items, i.e. this problem needs to be addressed.

szoupanos commented 6 years ago

A small & fast test SQLA is slower than Django (4 times) But it doesn't get slower over time

(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p sqla_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 3.3376698494, Throughput (time per 100 nodes): 3.3376698494
Counter: 200, Time elspsed in ms: 6.52040481567, Throughput (time per 100 nodes): 3.18273496628
Counter: 300, Time elspsed in ms: 9.71689081192, Throughput (time per 100 nodes): 3.19648599625
Counter: 400, Time elspsed in ms: 12.9863698483, Throughput (time per 100 nodes): 3.26947903633
Counter: 500, Time elspsed in ms: 16.3025808334, Throughput (time per 100 nodes): 3.31621098518
Counter: 600, Time elspsed in ms: 19.6147549152, Throughput (time per 100 nodes): 3.3121740818
Counter: 700, Time elspsed in ms: 22.9152059555, Throughput (time per 100 nodes): 3.30045104027
Counter: 800, Time elspsed in ms: 26.2304990292, Throughput (time per 100 nodes): 3.31529307365
Counter: 900, Time elspsed in ms: 29.4893889427, Throughput (time per 100 nodes): 3.25888991356
Counter: 1000, Time elspsed in ms: 32.7418749332, Throughput (time per 100 nodes): 3.25248599052
Counter: 1100, Time elspsed in ms: 35.9874498844, Throughput (time per 100 nodes): 3.24557495117
Counter: 1200, Time elspsed in ms: 39.1917059422, Throughput (time per 100 nodes): 3.20425605774
Counter: 1300, Time elspsed in ms: 42.5265669823, Throughput (time per 100 nodes): 3.33486104012
Counter: 1400, Time elspsed in ms: 45.8009498119, Throughput (time per 100 nodes): 3.27438282967
Counter: 1500, Time elspsed in ms: 49.0933759212, Throughput (time per 100 nodes): 3.29242610931
Time elapsed in ms 50.9183118343
(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p dj_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 0.876977920532, Throughput (time per 100 nodes): 0.876977920532
Counter: 200, Time elspsed in ms: 1.72307682037, Throughput (time per 100 nodes): 0.846098899841
Counter: 300, Time elspsed in ms: 2.47800397873, Throughput (time per 100 nodes): 0.754927158356
Counter: 400, Time elspsed in ms: 3.23580598831, Throughput (time per 100 nodes): 0.757802009583
Counter: 500, Time elspsed in ms: 4.03150391579, Throughput (time per 100 nodes): 0.795697927475
Counter: 600, Time elspsed in ms: 4.83035802841, Throughput (time per 100 nodes): 0.798854112625
Counter: 700, Time elspsed in ms: 5.60914301872, Throughput (time per 100 nodes): 0.778784990311
Counter: 800, Time elspsed in ms: 6.47481393814, Throughput (time per 100 nodes): 0.865670919418
Counter: 900, Time elspsed in ms: 7.33511090279, Throughput (time per 100 nodes): 0.860296964645
Counter: 1000, Time elspsed in ms: 8.20029091835, Throughput (time per 100 nodes): 0.865180015564
Counter: 1100, Time elspsed in ms: 9.10501289368, Throughput (time per 100 nodes): 0.904721975327
Counter: 1200, Time elspsed in ms: 9.9322988987, Throughput (time per 100 nodes): 0.82728600502
Counter: 1300, Time elspsed in ms: 10.8640909195, Throughput (time per 100 nodes): 0.931792020798
Time elapsed in ms 11.7664499283
szoupanos commented 6 years ago

And by limiting the cifs to 700, I get the following results

(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p sqla_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 2.21686291695, Throughput (time per 100 nodes): 2.21686291695
Counter: 200, Time elspsed in ms: 4.34723496437, Throughput (time per 100 nodes): 2.13037204742
Counter: 300, Time elspsed in ms: 6.46241307259, Throughput (time per 100 nodes): 2.11517810822
Counter: 400, Time elspsed in ms: 8.53719997406, Throughput (time per 100 nodes): 2.07478690147
Counter: 500, Time elspsed in ms: 10.5712640285, Throughput (time per 100 nodes): 2.03406405449
Counter: 600, Time elspsed in ms: 12.6843640804, Throughput (time per 100 nodes): 2.11310005188
Counter: 700, Time elspsed in ms: 14.8037650585, Throughput (time per 100 nodes): 2.11940097809
Time elapsed in ms 14.8044171333
(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p dj_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.01815199852, Throughput (time per 100 nodes): 1.01815199852
Counter: 200, Time elspsed in ms: 1.96495509148, Throughput (time per 100 nodes): 0.946803092957
Counter: 300, Time elspsed in ms: 2.91608095169, Throughput (time per 100 nodes): 0.951125860214
Counter: 400, Time elspsed in ms: 3.85885596275, Throughput (time per 100 nodes): 0.942775011063
Counter: 500, Time elspsed in ms: 4.79002594948, Throughput (time per 100 nodes): 0.931169986725
Counter: 600, Time elspsed in ms: 5.74456310272, Throughput (time per 100 nodes): 0.954537153244
Counter: 700, Time elspsed in ms: 6.774269104, Throughput (time per 100 nodes): 1.02970600128
Time elapsed in ms 6.77494692802
szoupanos commented 6 years ago

Leo, I am looking at this and I am unable to reproduce it... Moreover, for no "obvious reason" the numbers look good. I.e. SQLA has comparable (or even better times) than Django for 400 but also for 1300 results.

400 updates

(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p dj_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.61309695244, Throughput (time per 100 nodes): 1.61309695244
Counter: 200, Time elspsed in ms: 3.13849496841, Throughput (time per 100 nodes): 1.52539801598
Counter: 300, Time elspsed in ms: 4.85553193092, Throughput (time per 100 nodes): 1.71703696251
Counter: 400, Time elspsed in ms: 6.48597598076, Throughput (time per 100 nodes): 1.63044404984
Time elapsed in ms 6.48603892326
(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p sqla_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.55413198471, Throughput (time per 100 nodes): 1.55413198471
Counter: 200, Time elspsed in ms: 3.05705595016, Throughput (time per 100 nodes): 1.50292396545
Counter: 300, Time elspsed in ms: 4.61017894745, Throughput (time per 100 nodes): 1.55312299728
Counter: 400, Time elspsed in ms: 6.11737394333, Throughput (time per 100 nodes): 1.50719499588
Time elapsed in ms 6.11741995811

1300 updates

(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p dj_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.73415493965, Throughput (time per 100 nodes): 1.73415493965
Counter: 200, Time elspsed in ms: 3.43853402138, Throughput (time per 100 nodes): 1.70437908173
Counter: 300, Time elspsed in ms: 5.20282387733, Throughput (time per 100 nodes): 1.76428985596
Counter: 400, Time elspsed in ms: 6.819480896, Throughput (time per 100 nodes): 1.61665701866
Counter: 500, Time elspsed in ms: 8.45319795609, Throughput (time per 100 nodes): 1.63371706009
Counter: 600, Time elspsed in ms: 10.1259660721, Throughput (time per 100 nodes): 1.672768116
Counter: 700, Time elspsed in ms: 11.9293949604, Throughput (time per 100 nodes): 1.80342888832
Counter: 800, Time elspsed in ms: 13.8698630333, Throughput (time per 100 nodes): 1.94046807289
Counter: 900, Time elspsed in ms: 15.7028999329, Throughput (time per 100 nodes): 1.83303689957
Counter: 1000, Time elspsed in ms: 17.604598999, Throughput (time per 100 nodes): 1.90169906616
Counter: 1100, Time elspsed in ms: 19.4574260712, Throughput (time per 100 nodes): 1.85282707214
Counter: 1200, Time elspsed in ms: 21.3064129353, Throughput (time per 100 nodes): 1.84898686409
Counter: 1300, Time elspsed in ms: 23.1734988689, Throughput (time per 100 nodes): 1.86708593369
Time elapsed in ms 23.1735398769
(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p sqla_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.53062701225, Throughput (time per 100 nodes): 1.53062701225
Counter: 200, Time elspsed in ms: 3.03121781349, Throughput (time per 100 nodes): 1.50059080124
Counter: 300, Time elspsed in ms: 4.57060885429, Throughput (time per 100 nodes): 1.5393910408
Counter: 400, Time elspsed in ms: 6.09904384613, Throughput (time per 100 nodes): 1.52843499184
Counter: 500, Time elspsed in ms: 7.65300893784, Throughput (time per 100 nodes): 1.55396509171
Counter: 600, Time elspsed in ms: 9.18871998787, Throughput (time per 100 nodes): 1.53571105003
Counter: 700, Time elspsed in ms: 10.7138359547, Throughput (time per 100 nodes): 1.5251159668
Counter: 800, Time elspsed in ms: 12.2518479824, Throughput (time per 100 nodes): 1.53801202774
Counter: 900, Time elspsed in ms: 13.7863049507, Throughput (time per 100 nodes): 1.53445696831
Counter: 1000, Time elspsed in ms: 15.224752903, Throughput (time per 100 nodes): 1.43844795227
Counter: 1100, Time elspsed in ms: 16.5713739395, Throughput (time per 100 nodes): 1.34662103653
Counter: 1200, Time elspsed in ms: 18.0671589375, Throughput (time per 100 nodes): 1.49578499794
Counter: 1300, Time elspsed in ms: 19.5408229828, Throughput (time per 100 nodes): 1.47366404533
Time elapsed in ms 19.5408809185

I used the following scripts to populate the databases extra_update.py.txt structure.cif.txt upload_cif.py.txt

I will also try with more CIF files. I use release the branch release_v0.12.0

ltalirz commented 6 years ago

very strange... so let me check a few things:

Please check whether the numbers stay the same when you increase to 5000 results. If they do, I will repeat the test on my machine again. (by the way 0.12 is released now, i.e. the release branch has been deleted, we can just user master)

ltalirz commented 6 years ago

@szoupanos Ah wait: You are not pre-loading the nodes anymore! You are just iterating over uuids and then load the nodes inside the loop. That will be the reason.

szoupanos commented 6 years ago

Yes.... This is the reason... (thanks!) Let me look a bit more then

ltalirz commented 6 years ago

P.S. By the way, I realize that pre-loading the nodes may not even be the best strategy when you want to set extras, which currently immediately triggers a write to the DB. But I believe it can certainly make sense to load a large number of nodes in one go, if you later want to do other operations on them that don't each require a write to the DB.

More generally speaking, though, what we from the LSMO really want is a way to do bulk operations on nodes that often don't require the creation of the python objects at all. Such as

In these operations, there is really no need for the python objects ever to be created in memory, and I'm sure that running a bulk insert query on the DB is orders of magnitude faster than what we are doing at the moment.

szoupanos commented 6 years ago

This is a good point but it has to be done in a "clever way". Theoretically (but also practically) speaking at the AiiDA level, we should be unaware about the backend used. Moreover, bypassing SQLA ORM and dealing directly with the database may lead to inconsistencies (i.e. changing directly the value of the db object that will be subsequently be overwritten by the SQLA ORM and not synching such changes).

But such an idea might be towards the correct direction or can be used as inspiration for bulk updates or group additions (i.e. not loading not needed data that may slow down the update procedure).

szoupanos commented 6 years ago

So here the a session with _expire_oncommit=False helps a lot to improve the SQLA time.

(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p dj_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.1341240406, Throughput (time per 100 nodes): 1.1341240406
Counter: 200, Time elspsed in ms: 2.20994710922, Throughput (time per 100 nodes): 1.07582306862
Counter: 300, Time elspsed in ms: 3.23935413361, Throughput (time per 100 nodes): 1.02940702438
Counter: 400, Time elspsed in ms: 4.28932714462, Throughput (time per 100 nodes): 1.04997301102
Counter: 500, Time elspsed in ms: 5.33309102058, Throughput (time per 100 nodes): 1.04376387596
Counter: 600, Time elspsed in ms: 6.32132196426, Throughput (time per 100 nodes): 0.98823094368
Counter: 700, Time elspsed in ms: 7.25477910042, Throughput (time per 100 nodes): 0.933457136154
Counter: 800, Time elspsed in ms: 8.17565703392, Throughput (time per 100 nodes): 0.920877933502
Counter: 900, Time elspsed in ms: 9.10686802864, Throughput (time per 100 nodes): 0.93121099472
Counter: 1000, Time elspsed in ms: 10.1082999706, Throughput (time per 100 nodes): 1.00143194199
Counter: 1100, Time elspsed in ms: 11.1209850311, Throughput (time per 100 nodes): 1.0126850605
Counter: 1200, Time elspsed in ms: 12.0926411152, Throughput (time per 100 nodes): 0.971656084061
Counter: 1300, Time elspsed in ms: 13.1262869835, Throughput (time per 100 nodes): 1.0336458683
Time elapsed in ms 13.1263301373
(aiidapy) aiida@ubuntu-aiida-vm1:~/cif_leo_2$ verdi -p sqla_set_update run extra_update.py
Performing query
Adding extras
Counter: 100, Time elspsed in ms: 1.05093502998, Throughput (time per 100 nodes): 1.05093502998
Counter: 200, Time elspsed in ms: 2.12162089348, Throughput (time per 100 nodes): 1.07068586349
Counter: 300, Time elspsed in ms: 3.0334148407, Throughput (time per 100 nodes): 0.91179394722
Counter: 400, Time elspsed in ms: 3.91820883751, Throughput (time per 100 nodes): 0.884793996811
Counter: 500, Time elspsed in ms: 4.90498399734, Throughput (time per 100 nodes): 0.986775159836
Counter: 600, Time elspsed in ms: 5.93195605278, Throughput (time per 100 nodes): 1.02697205544
Counter: 700, Time elspsed in ms: 6.99793696404, Throughput (time per 100 nodes): 1.06598091125
Counter: 800, Time elspsed in ms: 7.98858785629, Throughput (time per 100 nodes): 0.990650892258
Counter: 900, Time elspsed in ms: 8.99393391609, Throughput (time per 100 nodes): 1.0053460598
Counter: 1000, Time elspsed in ms: 9.99040985107, Throughput (time per 100 nodes): 0.996475934982
Counter: 1100, Time elspsed in ms: 10.9346079826, Throughput (time per 100 nodes): 0.944198131561
Counter: 1200, Time elspsed in ms: 11.9228379726, Throughput (time per 100 nodes): 0.988229990005
Counter: 1300, Time elspsed in ms: 12.8135650158, Throughput (time per 100 nodes): 0.890727043152
Time elapsed in ms 12.8142528534

According to the SQLA documentation http://docs.sqlalchemy.org/en/latest/orm/session_basics.html#what-does-the-session-do

Another behavior of commit() is that by default it expires the state of all instances present after the commit is complete. This is so that when the instances are next accessed, either through attribute access or by them being present in a Query result set, they receive the most recent state. To disable this behavior, configure sessionmaker with expire_on_commit=False.

On every set_extra() we have a commit() forcing the session to expire all the objects that are in that session, forcing the ones that are accesses to be updated with the latest DB value.

We should think that this is the behaviour that we really want though. ps: all the tests pass with _expire_oncommit=False

szoupanos commented 6 years ago

Moreover, given the fact that in SQLA we have one indefinite session, I don't see many problems from a first look. If something is changed outside that session (i.e. the daemon updates the value of something data or calculation) that we are trying to manipulate. But how often is this case and why can't we wrap it Martin's ModelWrapper (aiida_core/aiida/orm/implementation/sqlalchemy/utils.py).

Let's also invite in this discussion @giovannipizzi @muhrin @sphuber

ltalirz commented 6 years ago

@szoupanos Thanks for investigating this and getting to the source of the problem!

What we definitely need is a straightforward way to disable expire_on_commit in certain scenarios. Would that be possible, even if you keep expire_on_commit=True as a default?

As for what should be the default: I guess this is a decision between being sure to always get the latest data and performance. The downside of keeping it turned on is that inexperienced users will encounter performance bottlenecks and not know what the problem is (even for us it was really difficult to figure out what was going on!). Also, we would need to go through the code and figure out all the places, where we need to disable it (e.g. verdi export might be one...)

So, I'd be in favor of disabling it completely for the moment.

Edit: Talking about backend-independence - does django have a similar feature? I guess not..

szoupanos commented 6 years ago

I am trying to think at cases that expire_on_commit=False is bad and will result to problematic cases for the user but I have not found one. Keep in mind that when you call verdi until you exit, you use one session and you are the only user of that session.

Cases that I can imagine are objects changes by different sessions (e.g. verdi shell & daemon) but these seem specific scenarios that we can work on.

I am tempted to set it to False by default and to be able to set it to True by passing a flag to verdi. @giovannipizzi and @huber are setting a server for workflow tests, so I suppose that this would be a good way to stress-check if expire_on_commit=False could cause any problems that we can not imagine,

sphuber commented 1 year ago

I revisited this issue. Since the last discussion, a lot has happened, for example Django no longer exists and the default storage backend PsqlDosStorage uses SqlAlchemy. I think the main problem for the slow down is that by default no transaction is used, so a database commit is called after each extra being set, and by default all objects in the session are reset after a commit since expire_on_commit=True by default.

I tested this with the following script on a profile containing ~50K nodes

#!/usr/bin/env python
import click
from aiida.cmdline.utils.decorators import with_dbenv

@click.command()
@click.option('--with-transaction', is_flag=True)
@click.option('--disable-eoc', is_flag=True)
@click.option('--repeat', type=int, default=5)
@click.option('--number', type=int, default=5)
@click.argument('count', type=int)
@with_dbenv()
def main(with_transaction, disable_eoc, repeat, number, count):
    import contextlib
    import timeit
    from aiida.manage import get_manager
    from aiida.orm import QueryBuilder, Node
    from aiida.storage.psql_dos.orm.utils import disable_expire_on_commit

    if with_transaction:
        transaction_context = get_manager().get_backend().transaction
    else:
        transaction_context = contextlib.nullcontext

    def set_extras():
        for node, in QueryBuilder().append(Node).limit(count).all():
            node.base.extras.set('_testing_this', True)

    with transaction_context() as session:
        if not with_transaction:
            session = get_manager().get_backend().get_session()

        if disable_eoc:
            with disable_expire_on_commit(session):
                result = timeit.repeat(set_extras, repeat=repeat, number=number)
        else:
            result = timeit.repeat(set_extras, repeat=repeat, number=number)

    print(f'{(min(result) / number):.3f}')

if __name__ == '__main__':
    main()

It's a simple CLI that allows specifying the number of nodes for which to set an extra (random one) in a loop. With some flags, it allows to use a transaction and to disable the expire-on-commit (EOC). The results are as follows:

count - --with-transaction --with-transaction --disable-eoc --disable-eoc
100 0.943 0.189 0.158 0.825
200 1.786 0.264 0.534 1.262
400 3.689 0.373 0.446 2.982
800 8.082 0.691 0.698 5.904
1600 18.224 1.319 1.346 11.237
3200 55.374 3.564 3.063 23.261
6400 176.556 5.202 8.822 43.955

Takeaways:

A big problem of the script is that is uses .all(), just as the original example in the OP. This reads in all the results in memory and so makes the effect of the EOC worse. Really, one should be using .iterall(). Since the original discusssion, the code has been updated where .iterall() automatically opens a transaction. If we update the script to use .iterall() instead of .all() and run the exact test again, we get the following:

count - --with-transaction --with-transaction --disable-eoc --disable-eoc
100 0.104 0.089 0.092 0.121
200 0.200 0.168 0.159 0.213
400 0.400 0.324 0.302 0.380
800 0.833 0.570 0.531 0.731
1600 1.510 1.181 1.077 1.393
3200 2.769 2.835 2.644 2.833
6400 5.598 4.999 5.026 5.723

Takeaways:

Final conclusions

Action

I would suggest that there is not really something to change in the code, but rather we should make it very clear in the docs of the QueryBuilder that .iterall() should be preferred over .all() when performance is important. I would definitely not disable EOC by default as its effect is nullified when transactions are used, and it can cause unexpected behavior otherwise. Parallel processes (notebooks, interactive shells, daemon workers) will get out of sync when they are modifying the same data.

ltalirz commented 1 year ago

Thanks @sphuber for looking into this in detail, this is a viable solution.

The only question in my mind is: how are we going to bring AiiDA users to actually take advantage of this? Is there a way for us to detect cases where they are using all but should be using iterall instead?

Are we actually misusing it ourselves anywhere in aiida-core?

aiida-core$ git grep "\.all(" | grep -v test_
CHANGELOG.md:- `QueryBuilder`: add the `flat` argument to the `.all()` method [[#3945]](https://github.com/aiidateam/aiida-core/pull/3945)
aiida/cmdline/commands/cmd_code.py:        table.append(['Calculations', len(code.base.links.get_outgoing().all())])
aiida/cmdline/commands/cmd_computer.py:        return next(zip(*builder.all()))  # return the first entry
aiida/cmdline/commands/cmd_computer.py:    computers = Computer.collection.all()
aiida/cmdline/commands/cmd_data/cmd_list.py:    return object_list.all()
aiida/cmdline/commands/cmd_group.py:            group_node_pks = query.all(flat=True)
aiida/cmdline/commands/cmd_group.py:        source_group_node_pks = query.all(flat=True)
aiida/cmdline/commands/cmd_group.py:    target_group_node_pks = query.all(flat=True)
aiida/cmdline/commands/cmd_group.py:    result = builder.all()
aiida/cmdline/commands/cmd_node.py:        to_hash = builder.all()
aiida/cmdline/commands/cmd_node.py:        all_comments = node.base.comments.all()
aiida/cmdline/commands/cmd_rabbitmq.py:    ).all(flat=True)
aiida/cmdline/commands/cmd_user.py:    echo.echo_formatted_list(User.collection.all(), attributes, sort=sort, highlight=highlight)
aiida/cmdline/utils/ascii_vis.py:            for triple in caller.base.links.get_outgoing(link_type=(LinkType.CALL_CALC, LinkType.CALL_WORK)).all()
aiida/cmdline/utils/common.py:        links = sorted(nodes_caller.all(), key=lambda x: x.node.ctime)
aiida/cmdline/utils/common.py:        links = sorted(nodes_called.all(), key=lambda x: x.node.ctime)
aiida/cmdline/utils/common.py:        result = builder.all(flat=True)
aiida/cmdline/utils/echo.py:        echo.echo(tabulate.tabulate(qb.all()))
aiida/engine/processes/control.py:    return builder.all(flat=True)
aiida/engine/processes/functions.py:    >>> r.base.links.get_incoming().all() # doctest: +SKIP
aiida/engine/processes/functions.py:    >>> r.base.links.get_incoming().all() # doctest: +SKIP
aiida/orm/autogroup.py:        existing_group_labels = [res[0][len(label_prefix):] for res in queryb.all()]
aiida/orm/entities.py:        return cast(List[EntityType], query.all(flat=True))
aiida/orm/entities.py:        return cast(List[EntityType], self.query().all(flat=True))  # pylint: disable=no-member
aiida/orm/nodes/caching.py:            node for node in builder.all(flat=True) if node.base.caching.is_valid_cache
aiida/orm/nodes/data/array/bands.py:    bands_list_data = q_build.all()
aiida/orm/nodes/data/array/bands.py:    list_data = q_build.distinct().all()
aiida/orm/nodes/data/cif.py:        if atoms.pbc.all():
aiida/orm/nodes/data/cif.py:        return builder.all(flat=True)
aiida/orm/nodes/data/code/legacy.py:            codes = query.all(flat=True)
aiida/orm/nodes/data/code/legacy.py:        valid_codes = query.all(flat=True)
aiida/orm/nodes/data/upf.py:        return builder.all(flat=True)
aiida/orm/nodes/data/upf.py:        return query.all(flat=True)
aiida/orm/nodes/data/upf.py:        return builder.all(flat=True)
aiida/orm/nodes/links.py:        return [LinkTriple(entry[0], LinkType(entry[1]), entry[2]) for entry in builder.all()]
aiida/orm/nodes/node.py:        if node.base.links.get_incoming().all():
aiida/orm/nodes/node.py:        if node.base.links.get_outgoing().all():
aiida/orm/nodes/node.py:        if cache_node.base.links.get_outgoing(link_type=LinkType.RETURN).all():
aiida/orm/querybuilder.py:        results = qb.all()
aiida/orm/querybuilder.py:            qb.distinct().all() #or
aiida/orm/querybuilder.py:            res = self.all()
aiida/orm/utils/links.py:    if outdegree == 'unique' and source.base.links.get_outgoing(link_type=link_type, only_uuid=True).all():
aiida/orm/utils/links.py:            link_type=link_type, only_uuid=True, link_label_filter=link_label).all():
aiida/orm/utils/links.py:    if indegree == 'unique' and target.base.links.get_incoming(link_type=link_type, only_uuid=True).all():
aiida/orm/utils/links.py:            link_type=link_type, link_label_filter=link_label, only_uuid=True).all():
aiida/orm/utils/links.py:        LinkManager.all(): returns all entries from list
aiida/orm/utils/links.py:        return [entry.node for entry in self.all()]
aiida/orm/utils/links.py:        return [LinkPair(entry.link_type, entry.link_label) for entry in self.all()]
aiida/orm/utils/links.py:        return [entry.link_label for entry in self.all()]
aiida/orm/utils/loaders.py:        return builder.all()
aiida/orm/utils/remote.py:    for remote_data, computer_uuid in query.all():
aiida/restapi/common/identifiers.py:    unique_types = {(node_type, process_type if process_type else '') for node_type, process_type in builder.all()}
aiida/restapi/translator/nodes/node.py:        comments = node.base.comments.all()
aiida/storage/psql_dos/migrations/versions/django_0026_trajectory_symbols_to_attribute.py:    ).all()
aiida/storage/psql_dos/migrations/versions/django_0027_delete_trajectory_symbols_array.py:    ).all()
aiida/storage/psql_dos/migrations/versions/django_0037_attributes_extras_settings_json.py:            for node in conn.execute(select(node_tbl)).all():
aiida/storage/psql_dos/migrations/versions/django_0037_attributes_extras_settings_json.py:                attr_list = conn.execute(select(attr_tbl).where(attr_tbl.c.dbnode_id == node.id)).all()
aiida/storage/psql_dos/migrations/versions/django_0037_attributes_extras_settings_json.py:                extra_list = conn.execute(select(extra_tbl).where(extra_tbl.c.dbnode_id == node.id)).all()
aiida/storage/psql_dos/migrations/versions/django_0037_attributes_extras_settings_json.py:            for setting in conn.execute(select(setting_tbl)).all():
aiida/storage/psql_dos/models/node.py:        return self.outputs_q.all()
aiida/storage/psql_dos/models/node.py:        return self.inputs_q.all()  # pylint: disable=no-member
aiida/storage/psql_dos/orm/comments.py:        entities_to_delete = builder.all(flat=True)
aiida/storage/psql_dos/orm/computers.py:        return session.query(self.ENTITY_CLASS.MODEL_CLASS.label).all()
aiida/storage/psql_dos/orm/logs.py:        entities_to_delete = builder.all(flat=True)
aiida/storage/psql_dos/orm/querybuilder/main.py:        retdict['types'] = dict(types_query.group_by('typestring').all())
aiida/storage/psql_dos/orm/querybuilder/main.py:        stat = stat_query.group_by('cday').order_by('cday').all()
aiida/tools/archive/create.py:                                  project='id').all(  # type: ignore[arg-type]
aiida/tools/archive/create.py:        link_data = {LinkQuadruple(*row) for row in qbuilder.all(batch_size=batch_size)}
aiida/tools/archive/create.py:            ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:        group_nodes: List[Tuple[int, int]] = qbuilder.all(batch_size=batch_size)  # type: ignore[assignment]
aiida/tools/archive/create.py:            ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:                ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:                ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:                ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:            ).all(batch_size=batch_size, flat=True)
aiida/tools/archive/create.py:            group_nodes = qbuilder.all(batch_size=batch_size)  # type: ignore[assignment]
aiida/tools/archive/create.py:    unsealed_node_pks = qbuilder.all(batch_size=batch_size, flat=True)
aiida/tools/archive/imports.py:    input_id_email = dict(qbuilder.append(orm.User, project=['id', 'email']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['email', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:    input_id_uuid = dict(qbuilder.append(orm.Computer, project=['id', 'uuid']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['uuid', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:        ).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:                                for _, user_id, comp_id in qbuilder.all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:    input_id_uuid = dict(qbuilder.append(orm.Node, project=['id', 'uuid']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['uuid', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:    input_id_uuid = dict(qbuilder.append(orm.Log, project=['id', 'uuid']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['uuid', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:    input_id_uuid = dict(qbuilder.append(orm.Comment, project=['id', 'uuid']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['uuid', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/archive/imports.py:    input_id_uuid = dict(qbuilder.append(orm.Group, project=['id', 'uuid']).all(batch_size=query_params.batch_size))
aiida/tools/archive/imports.py:            }, project=['uuid', 'id']).all(batch_size=query_params.batch_size)
aiida/tools/graph/deletions.py:    node_pks = group_node_query.all(flat=True)
aiida/tools/graph/graph_traversers.py:    existing_pks = set(query_nodes.all(flat=True))
aiida/tools/groups/paths.py:        return query.all(flat=True)
aiida/tools/groups/paths.py:        for (label, node) in query.iterall(query_batch) if query_batch else query.all():
aiida/tools/visualization/graph.py:        traversed_nodes = {query_result[0]: query_result[1] for query_result in query.all()}
aiida/tools/visualization/graph.py:        traversed_nodes = {query_result[0]: query_result[1] for query_result in query.all()}
aiida/tools/visualization/graph.py:        traversed_nodes = {query_result[0]: query_result[1] for query_result in query.all()}
aiida/tools/visualization/graph.py:        traversed_nodes = {query_result[0]: query_result[1] for query_result in query.all()}
docs/source/developer_guide/core/internals.rst:  In [2]: c.get_incoming().all()
docs/source/developer_guide/core/internals.rst:  In [3]: c.get_outgoing().all()
docs/source/howto/data.rst:    In [1]: QueryBuilder().append(Group, filters={'type_string': 'core'}).all(flat=True)
docs/source/howto/data.rst:    group.add_nodes(q.all(flat=True))
docs/source/howto/data.rst:    for structure in qb.all(flat=True):
docs/source/howto/data.rst:The return value of ``qb.all(flat=True)`` would contain all the structures matching the above mentioned criteria.
docs/source/howto/query.rst:    all_results_l = qb.all()            # Returns a list of lists
docs/source/howto/query.rst:For the query above, ``qb.all()`` will return a list of lists, for which each element corresponds to one entity and contains two items: the PK of the ``Int`` node and its value.
docs/source/howto/query.rst:    Be aware that for consistency, ``QueryBuilder.all()`` / ``iterall()`` always returns a list of lists, even if you only project one property of a single entity.
docs/source/howto/query.rst:    Use ``QueryBuilder.all(flat=True)`` to return the query result as a flat list in this case.
docs/source/howto/query.rst:    for link_triple in node.get_incoming().all():
tests/tools/archive/utils.py:    return builder.all()
ltalirz commented 1 year ago

these look like potential candidates (although not very relevant ones)

aiida/orm/utils/links.py:        return [entry.node for entry in self.all()]
aiida/orm/utils/links.py:        return [LinkPair(entry.link_type, entry.link_label) for entry in self.all()]
aiida/orm/utils/links.py:        return [entry.link_label for entry in self.all()]

The difficulty I guess is in functions that return qb.all(); the question there is whether one would actually want to have an iterall version of that function as well...

sphuber commented 1 year ago

Is there a way for us to detect cases where they are using all but should be using iterall instead?

Not sure. There are legitimate use cases for using .all(). So we cannot issue a warning by default. We would have to check whether it is being used in a for loop, but I don't think that is possible.

Are we actually misusing it ourselves anywhere in aiida-core?

I went through the codebase and most of it seems ok. I essentially checked if the result was being iterated over and if the entire content was not actually needed to be stored in memory. Of course this doesn't cover any cases of functions that return the entire list instead of a generator, or code that could be improved with a larger refactoring. With the simple rule above, I found a few places that could be improved in a straightforward manner: https://github.com/aiidateam/aiida-core/pull/6130