Closed sphuber closed 6 months ago
Just to report that I probably run into this issue while running my work chain.
I am using AiiDA 1.6.5 on Ubuntu 20.04. I had 6 workers running at the same time.
I submitted an Isotherm work chain that failed with the following error message:
$ verdi process report 547
...
2021-10-27 07:40:11 [1005 | REPORT]: [2850|RaspaBaseWorkChain|run_process]: launching RaspaCalculation<2858> iteration #1
2021-10-27 08:01:16 [1137 | REPORT]: [2850|RaspaBaseWorkChain|results]: work chain completed after 1 iterations
2021-10-27 08:01:16 [1138 | REPORT]: [2850|RaspaBaseWorkChain|results]: work chain completed after 1 iterations
2021-10-27 08:01:16 [1139 | ERROR]: Traceback (most recent call last):
File "/home/xiaoqi/aiida1/aiida-core/aiida/manage/external/rmq.py", line 208, in _continue
result = await super()._continue(communicator, pid, nowait, tag)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
await proc.step_until_terminated()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
await self.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 1216, in step
self.transition_to(next_state)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
raise exception.with_traceback(trace)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
self._enter_next_state(new_state)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
callback(self, hook, state)
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 326, in <lambda>
lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 387, in on_entered
self.update_node_state(self._state)
File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 610, in update_node_state
self.update_outputs()
File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 633, in update_outputs
output.add_incoming(self.node, LinkType.RETURN, link_label)
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/nodes/node.py", line 802, in add_incoming
self.validate_incoming(source, link_type, link_label)
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/nodes/node.py", line 827, in validate_incoming
validate_link(source, self, link_type, link_label)
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/links.py", line 165, in validate_link
raise ValueError(f'node<{source.uuid}> already has an outgoing {link_type} link with label "{link_label}"')
ValueError: node<7f185abb-cc9b-49b5-a1e4-f0011e655400> already has an outgoing LinkType.RETURN link with label "remote_folder"
2021-10-27 08:01:31 [1140 | REPORT]: [547|IsothermAccurateWorkChain|on_except]: Traceback (most recent call last):
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 93, in _get_node_by_link_label
node = attribute_dict[label]
KeyError: 'output_parameters'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 141, in __getattr__
return self._get_node_by_link_label(label=name)
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 120, in _get_node_by_link_label
raise NotExistent from exception
aiida.common.exceptions.NotExistent
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/process_states.py", line 231, in execute
result = self.run_fn(*self.args, **self.kwargs)
File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/workchains/workchain.py", line 214, in _do_step
finished, stepper_result = self._stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
finished, result = self._child_stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 432, in step
finished, retval = self._child_stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
finished, result = self._child_stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 432, in step
finished, retval = self._child_stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
finished, result = self._child_stepper.step()
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 527, in step
if self._while_instruction.is_true(self._workchain):
File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 395, in is_true
return self._predicate(workflow)
File "/home/xiaoqi/aiida1/aiida-lsmo/aiida_lsmo/workchains/isotherm_accurate.py", line 525, in should_run_another_gcmc_highp
self.ctx.gcmc_loading_average = self._get_last_loading_molkg()
File "/home/xiaoqi/aiida1/aiida-lsmo/aiida_lsmo/workchains/isotherm_accurate.py", line 470, in _get_last_loading_molkg
gcmc_out_param = self.ctx.raspa_gcmc[-1].outputs.output_parameters
File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 149, in __getattr__
) from exception
aiida.common.exceptions.NotExistentAttributeError: Node<2850> does not have an output with link label 'output_parameters'
It appears to me that two workers were trying to attach the same output link to the RaspaBaseWorkChain
and the second one failed: ValueError: node<7f185abb-cc9b-49b5-a1e4-f0011e655400> already has an outgoing LinkType.RETURN link with label "remote_folder"
See also that the message work chain completed after 1 iterations
appears twice in the report, which shouldn't happen.
Thanks for the report. Quite a bit has changed since I originally opened this issue to prevent this from happening. But although it should be unlikely, I don't think we can exclude it. Still we see this very rarely; I have personally never seen it and we don't get any reports, but maybe people simply aren't reporting it when they encounter it. It is a tricky problem. I have never gone the route of implementing some kind of lock-mechanism on our side (as suggested in this issue) because doing this properly is very tricky and I expect that we just create a bunch of new (potentially worse) problems. So not sure how best to approach this.
Due to the limited heartbeat timeout that is set by the default configuration of the RabbitMQ server (see #1821), the connection of daemon workers are often closed by the server. Given the contract of RabbitMQ (it guarantees that each task it receives will be completed, therefore as soon as it thinks a process, to which it has sent one or multiple tasks, is dead, it will requeue those tasks), when it closes the connection to a worker, it will requeue the tasks that that worker did not yet acknowledge, which will eventually be sent out to another worker. Assuming that the worker is still alive and only missed its heartbeat because it was too busy too respond, it will potentially still be working on the tasks that will be resent. Now there are two workers working on the same process, leading to race conditions and all sorts of problems, since AiiDA is not designed with concurrent operations on a single node in mind.
Therefore, a locking principle needs to be introduced that will lock a node for as long as a worker is operating on it. This lock needs to be released when the process is finished or the worker shutsdown or dies.