Closed zhubonan closed 2 weeks ago
Hello @zhubonan, skimming through the issue it looks like a bug we recently fixed https://github.com/aiidateam/aiida-core/issues/6579 that was introduced in 2.6.2. We will release a new version this week that includes the fix. I will update the issue once it is released. Temporary fix would be to downgrade to 2.6.1.
Thanks! Glad to know it is already fixed in the upcoming version. My temporary fix was to rerun with caching turned on, and in most cases they finished fine (perhaps by chance) 😄 .
I am not 100% sure if it is this fix, but what you describe sounds very like the problem we faced in the bug I referenced. We just released a new version 2.6.3 with the fix. Please let us know if it fixes it and if not we'll look more into it.
Describe the bug
My workchains inspect the outputs of previously launched workchains. When I run them under heavy load (~800 processes), some of them went into
Excepted
state caused by the previously launched workchains' output link being missing when trying to inspect the results.However, the output of the previously launched workchain does exists, it was just that the daemon running the workchain did not find it at the time. I can rerun the same workchain using the same input with
run_get_node
and it would finish without error.Such error happen occaptionally (20%). I suspect it was due to some issue when syncing the daemons workers. For example, perhaps the workchain whose outputs were missing are marked as Finished prematurely before its output links are attached (and commited to the database).
Here is an example:
Output of the
verdi process report
:But the node of the interest 153997 does have the output link:
Steps to reproduce
Steps to reproduce the behavior:
Try to run lots of workchain with steps that inspects the output of launched sub-workchains.
Expected behavior
The workchain should not except as the launched calculation finished outwith error.
Your environment
Other relevant software versions, e.g. Postres & RabbitMQ:
RabbitMQ v3.9.13 psql (PostgreSQL) 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1)
Additional context
The workchain's source code can be found here:
https://github.com/aiida-vasp/aiida-vasp/blob/4a3fc121de57db29c8ebaa6ec586451626cdb739/src/aiida_vasp/workchains/v2/bands.py#L606