Processes become unreachable when rebooting local machine

aiidateam / aiida-core

The official repository for the AiiDA code

https://aiida-core.readthedocs.io

Other

430 stars 186 forks source link

Processes become unreachable when rebooting local machine #5699

Closed tsthakur closed 1 year ago

tsthakur commented 1 year ago

Describe the bug

When there are a few hundred aiida jobs running on external cluster and the local machine is rebooted all the running jobs become unreachable on completion, i.e. they get stuck in waiting state and cannot be played or killed or salvaged in any manner. I have tried using get_manager().get_process_controller().continue_process(pk) but it doesn't do anything either.

Now there's a second level to this issue. All these unreachable processes still occupy the daemon slots, making aiida complain that there aren't enough daemons available even when there are zero jobs running. Since these stuck jobs cannot be killed, the only way to eliminate them is to delete the nodes, but that is not an option if one wants to salvage the calculations and use their output in future calculations. For example - if one is running a PwRelaxWorkChain and only the final scf on relaxed structure gets stuck, making the entire workchain unreachable, it would be desirable to only run the final scf instead of running the entire PwRelaxWorkChain. It is even more important to salvage the already completed calculations when the mother workchain is much more complex than PwRelaxWorkChain.

Steps to reproduce

Steps to reproduce the behavior:

Start a few hundred jobs on an external cluster.
While the jobs are still running, wait for the cluster to get shut down or have a problem that makes the cluster unavailable, which happens on Eiger and Lumi rather frequently.
Now that all the jobs are stuck in waiting state after multiple failed uploads due to downed cluster, reboot the machine. Sometimes but not always just restarting the daemon is sufficient instead of rebooting. Restarting the rabbitmq server is always sufficient to cause this.
Every single calculation including all their parent workflows become unreachable now.

Expected behavior

Reboot should have no impact on the state of calculations.
More importantly the uncreachable processes should not be hogging any resources.
There should be a way to salvage the part of the workchain that finished properly, instead of restarting from beginning.

Your environment

I have had this issue on 2 separate environments.

First environment

Operating system [e.g. Linux]: Ubuntu 20.04
Python version [e.g. 3.7.1]: 3.8.13
aiida-core version [e.g. 1.2.1]: 1.6.4
RabbitMQ: 3.7.28
PostgreSQL: 14.5

Second environment

Operating system [e.g. Linux]: Ubuntu 22.04
Python version [e.g. 3.7.1]: 3.9.13
aiida-core version [e.g. 1.2.1]: 1.6.8
RabbitMQ: 3.7.28
PostgreSQL: 14.5

Additional context

From what I understand it is the rabbitmq server that's causing the jobs to become unreachable. A similar issue seems to be still unresolved.

sphuber commented 1 year ago

Thanks for the report. Could you provide an answer to the following questions:

When you say processes are unreachable, based one what do you conclude that?
Output of verdi config get daemon.worker_process_slots
When you reboot the machine, how many daemon workers are you starting? Run verdi daemon status to check.

You say that all CalcJobs are paused after the upload failed 5 times. This makes sense, but you never mention that you "play" them again. This is a necessary step after the external computer comes back online. Do you actually play the paused processes again?

tsthakur commented 1 year ago

Thank you Sebastiaan for the questions.

When you say processes are unreachable, based one what do you conclude that? I conclude that based on aiida's ouput of Error: Process<510180> is unreachable when I run verdi process kill 510180
Output of verdi config get daemon.worker_process_slots (aiida168) tthakur@theospc31:~$ verdi config get daemon.worker_process_slots 200 I see that there are 128 calculations in RUNNING state right now.
When you reboot the machine, how many daemon workers are you starting? I always run 16 daemon workers, as I have 16 threads on my machine.

Please note that if I have something like 4 daemons running, aiida complains that 400% of the workers are occupied, even when there are no processes in RUNNING state. This is why I said that these unreachable processes were hogging resources. But the daemons' memory and CPU usage is usually around 0.1%.

You say that all CalcJobs are paused after the upload failed 5 times. This makes sense, but you never mention that you "play" them again. This is a necessary step after the external computer comes back online. Do you actually play the paused processes again?

I think I mentioned at the very beginning of the bug description that the process cannot be played or killed. But yes I could have been clearer. Running verdi process play pk is the first thing I try if a process seems to be "stuck" in WAITING state for no apparent reason. This usually results in 3 things -

Error: Process<pk> is unreachable
Success: played Process<pk> which is a complete lie as the process continues to remain in the WAITING state.
Nothing happens i.e. the bash command itself gets stuck.

Next I try get_manager().get_process_controller() from verdi shell to either continue or play process. In this case the play command returns an error while the continue command is executed but nothing happens, the process state continues to be stuck as WAITING.

tsthakur commented 1 year ago

Following is the error I receive when I run

pm = get_manager().get_process_controller() pm.play_process(510180)

PublishError Traceback (most recent call last) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py in rpc_send(self, recipient_id, msg) 514 publisher = await self.get_message_publisher() --> 515 response_future = await publisher.rpc_send(recipient_id, msg) 516 return response_future

~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py in rpc_send(self, recipient_id, msg) 41 message = aio_pika.Message(body=self._encode(msg), reply_to=self._reply_queue.name) ---> 42 published, response_future = await self.publish_expect_response( 43 message, routing_key=routing_key, mandatory=True

~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish_expect_response(self, message, routing_key, mandatory) 219 self._awaiting_response[correlation_id] = response_future --> 220 result = await self.publish(message, routing_key=routing_key, mandatory=mandatory) 221 return result, response_future

~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish(self, message, routing_key, mandatory) 208 """ --> 209 result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory) 210 return result

~/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/exchange.py in publish(self, message, routing_key, mandatory, immediate, timeout) 232 --> 233 return await asyncio.wait_for( ... --> 518 raise kiwipy.UnroutableError(str(exception)) 519 520 async def broadcast_send(self, body, sender=None, subject=None, correlation_id=None):

UnroutableError: ('NO_ROUTE', '[rpc].510265')

sphuber commented 1 year ago

Thanks for the additional info.

I think I mentioned at the very beginning of the bug description that the process cannot be played or killed. But yes I could have been clearer.

You are right, it is just that the behavior you describe is quite unique and not something I have come across. The only cause so far that I have come across for a process not being reachable is that the corresponding RabbitMQ task is missing. So far doing the continue_process trick reliably fixes the problem as it recreates the task. The fact that you report this isn't working is really surprising.

Please note that if I have something like 4 daemons running, aiida complains that 400% of the workers are occupied, even when there are no processes in RUNNING state. This is why I said that these unreachable processes were hogging resources. But the daemons' memory and CPU usage is usually around 0.1%.

A daemon slot is occupied also by processes that are in the CREATED or WAITING state. When a CalcJob is WAITING for example, it is waiting for the job on the computer to finish. It still needs to be with the daemon during this time, because it needs to temporarily fetch the state from the scheduler to check whether it is done. So when you "play" a process, it can still be in a WAITING state. The WAITING state doesn't mean a process isn't being played.

It is really difficult to debug further over github, so I will contact you on slack.

sphuber commented 1 year ago

After some interactive debugging, the situation seemed to be the following:

@tsthakur is running many FlipperCalculations as part of heavily nested workchains, and these calculations can retrieve large amounts of data (molecular dynamics trajectories). As a result, the daemon workers are sometimes under heavy load with blocking IO operations and cannot always respond to RPCs. This results in actions like verdi process play to return Process is unreachable.
The user interpreted the last message as the task having been lost and recreated it with the continue_process trick. This however created a duplicate task causing the process to be excepted.

There is not much we can do about the daemon workers being unresponsive when the plugins they are running are IO-heavy. However, I have submitted a PR https://github.com/aiidateam/aiida-core/pull/5715 that will prevent the calculations from excepting if a duplicate task is erroneously created.

blokhin commented 11 months ago

@sphuber is there any general way to handle errors like UnroutableError: ('NO_ROUTE', '[rpc].XXXXXX')?

I have got a bunch of unreachable Waiting processes elsewhere (and unfortunately I even have no idea what's happened there, it was just too much time ago). Just wanted to get rid of them. Also, verdi process kill does not help.

sphuber commented 11 months ago

The problem is that these processes no longer have a task with rabbitmq. If you don't care about them anymore, nor the data, you can simply delete them using verdi node delete. If you want to keep them and try to revive them, stop the daemon, then run verdi devel rabbitmq tasks analyze --fix. Then start the daemon and they should start again