aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 186 forks source link

WorkChains get excepted because of daemons get overwhelmed #5899

Open tsthakur opened 1 year ago

tsthakur commented 1 year ago

Describe the bug

I am not sure if the daemons getting overwhelmed is the reason behind it. But when I launch ~200 calculations together, they get excepted throwing a aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed error. It is similar to this issue that I opened previously.

Following is the full error report

(aiida168) tthakur@theospc31:~$ verdi process report 918283
2023-02-03 18:05:18 [564668 | REPORT]: [918283|LinDiffusionWorkChain|setup]: launching WorkChain with pinball coefficients defined by <813418>
2023-02-03 18:05:18 [564669 | REPORT]: [918283|LinDiffusionWorkChain|run_process]: launching ReplayMDWorkChain<918287>
2023-02-03 18:05:21 [564673 | REPORT]:   [918287|ReplayMDWorkChain|run_process]: launching FlipperCalculation<918302> iteration #1
2023-02-09 03:27:35 [572361 | REPORT]:   [918287|ReplayMDWorkChain|report_error_handled]: FlipperCalculation<918302> failed with exit status 312: The stdout output file was incomplete probably because the calculation got interrupted.
2023-02-09 03:27:35 [572362 | REPORT]:   [918287|ReplayMDWorkChain|report_error_handled]: Action taken: Restarting calculation...
2023-02-09 03:27:35 [572363 | REPORT]:   [918287|ReplayMDWorkChain|inspect_process]: FlipperCalculation<918302> failed but a handler dealt with the problem, restarting
2023-02-09 03:27:35 [572364 | REPORT]:   [918287|ReplayMDWorkChain|check_energy_fluctuations]: FlipperCalculation<918302> [check_energy_fluctuations]: Total energy fluctuations = 0.004842710000957595 < threshold (uuid: dd4cc43d-935f-4abe-abc9-d2646a108927 (pk: 918276) value: 180.0) OK
2023-02-09 03:27:35 [572365 | REPORT]:   [918287|ReplayMDWorkChain|update_mdsteps]: FlipperCalculation<918302> ran 109190 steps (109190 done - 890810 to go).
2023-02-09 03:31:09 [572501 | REPORT]:   [918287|ReplayMDWorkChain|run_process]: launching FlipperCalculation<924759> iteration #2
2023-02-14 21:24:05 [637726 |  ERROR]:   Traceback (most recent call last):
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/process_comms.py", line 607, in _continue
    proc = cast('Process', saved_state.unbundle(self._load_context))
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/persistence.py", line 60, in unbundle
    return Savable.load(self, load_context)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/persistence.py", line 452, in load
    return load_cls.recreate_from(saved_state, load_context)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 239, in recreate_from
    call_with_super_check(process.init)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/utils.py", line 29, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 159, in init
    super().init()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/utils.py", line 16, in wrapper
    wrapped(self, *args, **kwargs)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 298, in init
    identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/communications.py", line 141, in add_rpc_subscriber
    return self._communicator.add_rpc_subscriber(converted, identifier)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 215, in add_rpc_subscriber
    return self._loop_scheduler.await_(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 159, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 36, in done
    result = done_future.result()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 258, in __step
    result = coro.throw(exc)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in proxy
    return await awaitable
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 482, in add_rpc_subscriber
    identifier = await msg_subscriber.add_rpc_subscriber(subscriber, identifier)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 123, in add_rpc_subscriber
    rpc_queue = await self._channel.declare_queue(exclusive=True, arguments=self._rmq_queue_arguments)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/robust_channel.py", line 173, in declare_queue
    queue = await super().declare_queue(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/channel.py", line 325, in declare_queue
    await queue.declare(timeout=timeout)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/queue.py", line 92, in declare
    self.declaration_result = await asyncio.wait_for(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 703, in queue_declare
    return await self.rpc(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
    future.result()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None

2023-02-14 21:24:06 [637729 |  ERROR]: Traceback (most recent call last):
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/process_comms.py", line 613, in _continue
    await proc.step_until_terminated()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
    await self.step()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 1216, in step
    self.transition_to(next_state)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
    self._enter_next_state(new_state)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
    callback(self, hook, state)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 326, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 390, in on_entered
    super().on_entered(from_state)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 700, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/communications.py", line 175, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 258, in broadcast_send
    result = self._loop_scheduler.await_(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 159, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 36, in done
    result = done_future.result()
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in proxy
    return await awaitable
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 522, in broadcast_send
    result = await publisher.broadcast_send(body, sender, subject, correlation_id)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 66, in broadcast_send
    return await self.publish(message, routing_key=defaults.BROADCAST_TOPIC, mandatory=False)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py", line 209, in publish
    result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory)
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/exchange.py", line 233, in publish
    return await asyncio.wait_for(
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 508, in basic_publish
    async with self.lock:
  File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 90, in lock
    raise ChannelInvalidStateError("%r closed" % self)
aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed

Steps to reproduce

Steps to reproduce the behavior:

  1. Launch a lot of WorkChains (>100) withing a few hours.
  2. Wait for all the processes to leave the Created state and start running properly.
  3. Optionally restart the daemon, but this is not strictly required.
  4. Most WorkChains will except at this point.

Expected behavior

Nothing should happen, the workchains should run normally and not get excepted.

Your environment

Additional context

For some reason I am seeing this issue much more frequently now. It used to happen once in a blue moon only if I restarted the daemons, but last time it happened I didn't do anything, the WCs just got excepted after I left the machine alone over the weekend. My environment is still the same, only my aiida database has become bigger.

ireaml commented 1 year ago

Hi @tsthakur, I'm experiencing the same issue.

My environment details are:

I get a similar report message:

[22m2023-02-21 15:56:14 [3967 | REPORT]: [14092|VaspWorkChain|run_process]: launching VaspCalculation<15027> iteration #1
2023-02-21 16:25:23 [4125 |  ERROR]: Traceback (most recent call last):
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiida/manage/external/rmq/launcher.py", line 90, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/process_comms.py", line 604, in _continue
    proc = cast('Process', saved_state.unbundle(self._load_context))
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/persistence.py", line 58, in unbundle
    return Savable.load(self, load_context)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/persistence.py", line 450, in load
    return load_cls.recreate_from(saved_state, load_context)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/processes.py", line 244, in recreate_from
    call_with_super_check(process.init)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/base/utils.py", line 29, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/process.py", line 185, in init
    super().init()
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/base/utils.py", line 16, in wrapper
    wrapped(self, *args, **kwargs)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/processes.py", line 303, in init
    identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/communications.py", line 141, in add_rpc_subscriber
    return self._communicator.add_rpc_subscriber(converted, identifier)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/threadcomms.py", line 215, in add_rpc_subscriber
    return self._loop_scheduler.await_(
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/pytray/aiothreads.py", line 164, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/pytray/aiothreads.py", line 178, in coro
    res = await awaitable
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/communicator.py", line 482, in add_rpc_subscriber
    identifier = await msg_subscriber.add_rpc_subscriber(subscriber, identifier)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/communicator.py", line 123, in add_rpc_subscriber
    rpc_queue = await self._channel.declare_queue(exclusive=True, arguments=self._rmq_queue_arguments)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/robust_channel.py", line 173, in declare_queue
    queue = await super().declare_queue(
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/channel.py", line 325, in declare_queue
    await queue.declare(timeout=timeout)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/queue.py", line 92, in declare
    self.declaration_result = await asyncio.wait_for(
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/channel.py", line 703, in queue_declare
    return await self.rpc(
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/base.py", line 168, in wrap
    return await self.create_task(func(self, *args, **kwargs))
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/base.py", line 25, in __inner
    return await self.task
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/futures.py", line 285, in __await__
    yield self  # This tells Task to wait for completion.
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/channel.py", line 121, in rpc
    raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None