aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 191 forks source link

Output link missing (transient) when running workflows under heavy load #6598

Closed zhubonan closed 2 weeks ago

zhubonan commented 3 weeks ago

Describe the bug

My workchains inspect the outputs of previously launched workchains. When I run them under heavy load (~800 processes), some of them went into Excepted state caused by the previously launched workchains' output link being missing when trying to inspect the results.

However, the output of the previously launched workchain does exists, it was just that the daemon running the workchain did not find it at the time. I can rerun the same workchain using the same input with run_get_node and it would finish without error.

Such error happen occaptionally (20%). I suspect it was due to some issue when syncing the daemons workers. For example, perhaps the workchain whose outputs were missing are marked as Finished prematurely before its output links are attached (and commited to the database).

Here is an example:

$ verdi process status 153962
VaspHybridBandsWorkChain<153962> Excepted [5:inspect_and_combine_bands]
    ├── VaspRelaxWorkChain<153965> Finished [0] [5:finalize]
    │   └── VaspWorkChain<153968> Finished [0] [4:results]
    │       └── VaspCalculation<153971> Finished [0]
    ├── kpath_from_sumo_v2<153987> Finished [0]
    ├── extract_kpoints_from_retrieved<153990> Finished [0]
    ├── split_kpoints<153993> Finished [0]
    └── VaspWorkChain<153997> Finished [0] [4:results]
        └── VaspCalculation<153999> Finished [0]

Output of the verdi process report:

$ verdi process report 153962
2024-11-02 15:43:01 [41104 | REPORT]:   [153965|VaspRelaxWorkChain|init_relaxed]: skipping structure relaxation and forwarding input to the next workchain.
2024-11-02 15:43:01 [41105 | REPORT]:   [153965|VaspRelaxWorkChain|run_static_calculation]: launching VaspWorkChain<153968> iterations #1
2024-11-02 15:43:01 [41106 | REPORT]:     [153968|VaspWorkChain|run_process]: launching VaspCalculation<153971> iteration #1
2024-11-02 15:46:11 [41128 | REPORT]:     [153968|VaspWorkChain|results]: work chain completed after 1 iterations
2024-11-02 15:46:11 [41129 | REPORT]: [153962|VaspHybridBandsWorkChain|generate_path]: The primitive structure is not the same as the input structure - using the former for all calculations from now.
2024-11-02 15:46:11 [41130 | REPORT]: [153962|VaspHybridBandsWorkChain|make_splitted_kpoints]: Extracted SCF kpoints from retrieved vasprun.xml of <uuid: 8286777e-8e0b-428c-b6cf-d765ff6e1d1f (pk: 153965) (aiida.workflows:vasp.relax)>.
2024-11-02 15:46:12 [41131 | REPORT]: [153962|VaspHybridBandsWorkChain|run_scf_multi]: Checking the magnetization of the relaxed structure.
2024-11-02 15:46:12 [41132 | REPORT]: [153962|VaspHybridBandsWorkChain|run_scf_multi]: Turnning off spin polarization for band structure calculation for non-magnetic system.
2024-11-02 15:46:12 [41133 | REPORT]: [153962|VaspHybridBandsWorkChain|run_scf_multi]: launching VaspWorkChain<153997> for split #0
2024-11-02 15:46:13 [41134 | REPORT]:   [153997|VaspWorkChain|run_process]: launching VaspCalculation<153999> iteration #1
2024-11-02 15:46:13 [41135 | REPORT]:     [153968|VaspWorkChain|on_terminated]: cleaned remote folders of calculations: 153971
2024-11-02 16:06:45 [41321 | REPORT]:   [153997|VaspWorkChain|results]: work chain completed after 1 iterations
2024-11-02 16:06:45 [41322 | REPORT]: [153962|VaspHybridBandsWorkChain|inspect_and_combine_bands]: Extracting output bandstructure from 1 workchains.
2024-11-02 16:06:45 [41323 | REPORT]: [153962|VaspHybridBandsWorkChain|on_except]: Traceback (most recent call last):
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/aiida/orm/utils/managers.py", line 88, in _get_node_by_link_label
    node = attribute_dict[label]
           ~~~~~~~~~~~~~~^^^^^^^
KeyError: 'bands'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/aiida/orm/utils/managers.py", line 133, in __getattr__
    return self._get_node_by_link_label(label=name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/aiida/orm/utils/managers.py", line 116, in _get_node_by_link_label
    raise NotExistent from exception
aiida.common.exceptions.NotExistent

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/plumpy/process_states.py", line 228, in execute
    result = self.run_fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/aiida/engine/processes/workchains/workchain.py", line 319, in _do_step
    finished, stepper_result = self._stepper.step()
                               ^^^^^^^^^^^^^^^^^^^^
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/plumpy/workchains.py", line 295, in step
    finished, result = self._child_stepper.step()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/plumpy/workchains.py", line 246, in step
    return True, self._fn(self._workchain)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bonan/aiida_env/aiida-2.0/aiida-vasp/src/aiida_vasp/workchains/v2/bands.py", line 801, in inspect_and_combine_bands
    kwargs[f'band_{link_idx:03d}'] = work.outputs.bands
                                     ^^^^^^^^^^^^^^^^^^
  File "/home/bonan/miniconda3/envs/aiida/lib/python3.12/site-packages/aiida/orm/utils/managers.py", line 139, in __getattr__
    raise NotExistentAttributeError(
aiida.common.exceptions.NotExistentAttributeError: Node<153997> does not have an output with link label 'bands'

But the node of the interest 153997 does have the output link:

$ verdi process show 153997
Property     Value
-----------  ----------------------------------------------
type         VaspWorkChain
state        Finished [0]
pk           153997
uuid         a09a773f-0f1f-40f6-8b43-b8dee013dc3e
label        XXXXXXXXXXXXXXXX
description
ctime        2024-11-02 15:46:12.253137+08:00
mtime        2024-11-02 16:06:45.928962+08:00

Inputs                 PK  Type
-----------------  ------  -------------
clean_workdir      153958  Bool
code                74047  InstalledCode
kpoints            153994  KpointsData
kpoints_spacing    153954  Float
max_iterations     153957  Int
options            153955  Dict
parameters         153995  Dict
potential_family   153952  Str
potential_mapping  153953  Dict
settings           153996  Dict
structure          153988  StructureData
verbose            153959  Bool

Outputs            PK  Type
-------------  ------  ----------
bands          154583  BandsData
misc           154582  Dict
remote_folder  154000  RemoteData
retrieved      154581  FolderData

Caller                       PK  Type
-----------------------  ------  ------------------------
bandstructure_split_000  153962  VaspHybridBandsWorkChain

Called            PK  Type
------------  ------  ---------------
iteration_01  153999  VaspCalculation

Log messages
---------------------------------------------
There are 3 log messages for this calculation
Run 'verdi process report 153997' to see them

Steps to reproduce

Steps to reproduce the behavior:

Try to run lots of workchain with steps that inspects the output of launched sub-workchains.

Expected behavior

The workchain should not except as the launched calculation finished outwith error.

Your environment

Other relevant software versions, e.g. Postres & RabbitMQ:

RabbitMQ v3.9.13 psql (PostgreSQL) 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1)

Additional context

The workchain's source code can be found here:

https://github.com/aiida-vasp/aiida-vasp/blob/4a3fc121de57db29c8ebaa6ec586451626cdb739/src/aiida_vasp/workchains/v2/bands.py#L606

agoscinski commented 3 weeks ago

Hello @zhubonan, skimming through the issue it looks like a bug we recently fixed https://github.com/aiidateam/aiida-core/issues/6579 that was introduced in 2.6.2. We will release a new version this week that includes the fix. I will update the issue once it is released. Temporary fix would be to downgrade to 2.6.1.

zhubonan commented 2 weeks ago

Thanks! Glad to know it is already fixed in the upcoming version. My temporary fix was to rerun with caching turned on, and in most cases they finished fine (perhaps by chance) 😄 .

agoscinski commented 2 weeks ago

I am not 100% sure if it is this fix, but what you describe sounds very like the problem we faced in the bug I referenced. We just released a new version 2.6.3 with the fix. Please let us know if it fixes it and if not we'll look more into it.