aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
437 stars 192 forks source link

Calculation nodes need to be locked to prevent multiple workers running it #1823

Closed sphuber closed 6 months ago

sphuber commented 6 years ago

Due to the limited heartbeat timeout that is set by the default configuration of the RabbitMQ server (see #1821), the connection of daemon workers are often closed by the server. Given the contract of RabbitMQ (it guarantees that each task it receives will be completed, therefore as soon as it thinks a process, to which it has sent one or multiple tasks, is dead, it will requeue those tasks), when it closes the connection to a worker, it will requeue the tasks that that worker did not yet acknowledge, which will eventually be sent out to another worker. Assuming that the worker is still alive and only missed its heartbeat because it was too busy too respond, it will potentially still be working on the tasks that will be resent. Now there are two workers working on the same process, leading to race conditions and all sorts of problems, since AiiDA is not designed with concurrent operations on a single node in mind.

Therefore, a locking principle needs to be introduced that will lock a node for as long as a worker is operating on it. This lock needs to be released when the process is finished or the worker shutsdown or dies.

XiaoqZhang commented 3 years ago

Just to report that I probably run into this issue while running my work chain.

I am using AiiDA 1.6.5 on Ubuntu 20.04. I had 6 workers running at the same time.

I submitted an Isotherm work chain that failed with the following error message:

$ verdi process report 547

...
2021-10-27 07:40:11 [1005 | REPORT]:   [2850|RaspaBaseWorkChain|run_process]: launching RaspaCalculation<2858> iteration #1
2021-10-27 08:01:16 [1137 | REPORT]:   [2850|RaspaBaseWorkChain|results]: work chain completed after 1 iterations
2021-10-27 08:01:16 [1138 | REPORT]:   [2850|RaspaBaseWorkChain|results]: work chain completed after 1 iterations
2021-10-27 08:01:16 [1139 |  ERROR]:   Traceback (most recent call last):
  File "/home/xiaoqi/aiida1/aiida-core/aiida/manage/external/rmq.py", line 208, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
    await proc.step_until_terminated()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
    await self.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 1216, in step
    self.transition_to(next_state)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
    self._enter_next_state(new_state)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
    callback(self, hook, state)
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/processes.py", line 326, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 387, in on_entered
    self.update_node_state(self._state)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 610, in update_node_state
    self.update_outputs()
  File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/process.py", line 633, in update_outputs
    output.add_incoming(self.node, LinkType.RETURN, link_label)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/nodes/node.py", line 802, in add_incoming
    self.validate_incoming(source, link_type, link_label)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/nodes/node.py", line 827, in validate_incoming
    validate_link(source, self, link_type, link_label)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/links.py", line 165, in validate_link
    raise ValueError(f'node<{source.uuid}> already has an outgoing {link_type} link with label "{link_label}"')
ValueError: node<7f185abb-cc9b-49b5-a1e4-f0011e655400> already has an outgoing LinkType.RETURN link with label "remote_folder"

2021-10-27 08:01:31 [1140 | REPORT]: [547|IsothermAccurateWorkChain|on_except]: Traceback (most recent call last):
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 93, in _get_node_by_link_label
    node = attribute_dict[label]
KeyError: 'output_parameters'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 141, in __getattr__
    return self._get_node_by_link_label(label=name)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 120, in _get_node_by_link_label
    raise NotExistent from exception
aiida.common.exceptions.NotExistent

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/process_states.py", line 231, in execute
    result = self.run_fn(*self.args, **self.kwargs)
  File "/home/xiaoqi/aiida1/aiida-core/aiida/engine/processes/workchains/workchain.py", line 214, in _do_step
    finished, stepper_result = self._stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
    finished, result = self._child_stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 432, in step
    finished, retval = self._child_stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
    finished, result = self._child_stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 432, in step
    finished, retval = self._child_stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 299, in step
    finished, result = self._child_stepper.step()
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 527, in step
    if self._while_instruction.is_true(self._workchain):
  File "/home/xiaoqi/miniconda/envs/aiida1/lib/python3.7/site-packages/plumpy/workchains.py", line 395, in is_true
    return self._predicate(workflow)
  File "/home/xiaoqi/aiida1/aiida-lsmo/aiida_lsmo/workchains/isotherm_accurate.py", line 525, in should_run_another_gcmc_highp
    self.ctx.gcmc_loading_average = self._get_last_loading_molkg()
  File "/home/xiaoqi/aiida1/aiida-lsmo/aiida_lsmo/workchains/isotherm_accurate.py", line 470, in _get_last_loading_molkg
    gcmc_out_param = self.ctx.raspa_gcmc[-1].outputs.output_parameters
  File "/home/xiaoqi/aiida1/aiida-core/aiida/orm/utils/managers.py", line 149, in __getattr__
    ) from exception
aiida.common.exceptions.NotExistentAttributeError: Node<2850> does not have an output with link label 'output_parameters'

It appears to me that two workers were trying to attach the same output link to the RaspaBaseWorkChain and the second one failed: ValueError: node<7f185abb-cc9b-49b5-a1e4-f0011e655400> already has an outgoing LinkType.RETURN link with label "remote_folder"

See also that the message work chain completed after 1 iterations appears twice in the report, which shouldn't happen.

sphuber commented 3 years ago

Thanks for the report. Quite a bit has changed since I originally opened this issue to prevent this from happening. But although it should be unlikely, I don't think we can exclude it. Still we see this very rarely; I have personally never seen it and we don't get any reports, but maybe people simply aren't reporting it when they encounter it. It is a tricky problem. I have never gone the route of implementing some kind of lock-mechanism on our side (as suggested in this issue) because doing this properly is very tricky and I expect that we just create a bunch of new (potentially worse) problems. So not sure how best to approach this.