Closed tsthakur closed 1 year ago
Thanks for the report. Could you provide an answer to the following questions:
verdi config get daemon.worker_process_slots
verdi daemon status
to check.You say that all CalcJob
s are paused after the upload failed 5 times. This makes sense, but you never mention that you "play" them again. This is a necessary step after the external computer comes back online. Do you actually play the paused processes again?
Thank you Sebastiaan for the questions.
When you say processes are unreachable, based one what do you conclude that?
I conclude that based on aiida's ouput of Error: Process<510180> is unreachable
when I run verdi process kill 510180
Output of verdi config get daemon.worker_process_slots
(aiida168) tthakur@theospc31:~$ verdi config get daemon.worker_process_slots
200
I see that there are 128 calculations in RUNNING
state right now.
When you reboot the machine, how many daemon workers are you starting? I always run 16 daemon workers, as I have 16 threads on my machine.
Please note that if I have something like 4 daemons running, aiida complains that 400% of the workers are occupied, even when there are no processes in RUNNING
state. This is why I said that these unreachable processes were hogging resources. But the daemons' memory and CPU usage is usually around 0.1%.
You say that all CalcJobs are paused after the upload failed 5 times. This makes sense, but you never mention that you "play" them again. This is a necessary step after the external computer comes back online. Do you actually play the paused processes again?
I think I mentioned at the very beginning of the bug description that the process cannot be played or killed. But yes I could have been clearer.
Running verdi process play pk
is the first thing I try if a process seems to be "stuck" in WAITING
state for no apparent reason. This usually results in 3 things -
Error: Process<pk> is unreachable
Success: played Process<pk>
which is a complete lie as the process continues to remain in the WAITING
state.Next I try get_manager().get_process_controller()
from verdi shell
to either continue or play process. In this case the play command returns an error while the continue command is executed but nothing happens, the process state continues to be stuck as WAITING
.
Following is the error I receive when I run
pm = get_manager().get_process_controller()
pm.play_process(510180)
PublishError Traceback (most recent call last) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py in rpc_send(self, recipient_id, msg) 514 publisher = await self.get_message_publisher() --> 515 response_future = await publisher.rpc_send(recipient_id, msg) 516 return response_future
~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py in rpc_send(self, recipient_id, msg) 41 message = aio_pika.Message(body=self._encode(msg), reply_to=self._reply_queue.name) ---> 42 published, response_future = await self.publish_expect_response( 43 message, routing_key=routing_key, mandatory=True
~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish_expect_response(self, message, routing_key, mandatory) 219 self._awaiting_response[correlation_id] = response_future --> 220 result = await self.publish(message, routing_key=routing_key, mandatory=mandatory) 221 return result, response_future
~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish(self, message, routing_key, mandatory) 208 """ --> 209 result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory) 210 return result
~/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/exchange.py in publish(self, message, routing_key, mandatory, immediate, timeout) 232 --> 233 return await asyncio.wait_for( ... --> 518 raise kiwipy.UnroutableError(str(exception)) 519 520 async def broadcast_send(self, body, sender=None, subject=None, correlation_id=None):
UnroutableError: ('NO_ROUTE', '[rpc].510265')
Thanks for the additional info.
I think I mentioned at the very beginning of the bug description that the process cannot be played or killed. But yes I could have been clearer.
You are right, it is just that the behavior you describe is quite unique and not something I have come across. The only cause so far that I have come across for a process not being reachable is that the corresponding RabbitMQ task is missing. So far doing the continue_process
trick reliably fixes the problem as it recreates the task. The fact that you report this isn't working is really surprising.
Please note that if I have something like 4 daemons running, aiida complains that 400% of the workers are occupied, even when there are no processes in RUNNING state. This is why I said that these unreachable processes were hogging resources. But the daemons' memory and CPU usage is usually around 0.1%.
A daemon slot is occupied also by processes that are in the CREATED
or WAITING
state. When a CalcJob
is WAITING
for example, it is waiting for the job on the computer to finish. It still needs to be with the daemon during this time, because it needs to temporarily fetch the state from the scheduler to check whether it is done. So when you "play" a process, it can still be in a WAITING
state. The WAITING
state doesn't mean a process isn't being played.
It is really difficult to debug further over github, so I will contact you on slack.
After some interactive debugging, the situation seemed to be the following:
FlipperCalculation
s as part of heavily nested workchains, and these calculations can retrieve large amounts of data (molecular dynamics trajectories). As a result, the daemon workers are sometimes under heavy load with blocking IO operations and cannot always respond to RPCs. This results in actions like verdi process play
to return Process is unreachable
.continue_process
trick. This however created a duplicate task causing the process to be excepted.There is not much we can do about the daemon workers being unresponsive when the plugins they are running are IO-heavy. However, I have submitted a PR https://github.com/aiidateam/aiida-core/pull/5715 that will prevent the calculations from excepting if a duplicate task is erroneously created.
@sphuber is there any general way to handle errors like UnroutableError: ('NO_ROUTE', '[rpc].XXXXXX')
?
I have got a bunch of unreachable Waiting
processes elsewhere (and unfortunately I even have no idea what's happened there, it was just too much time ago). Just wanted to get rid of them. Also, verdi process kill
does not help.
The problem is that these processes no longer have a task with rabbitmq. If you don't care about them anymore, nor the data, you can simply delete them using verdi node delete
. If you want to keep them and try to revive them, stop the daemon, then run verdi devel rabbitmq tasks analyze --fix
. Then start the daemon and they should start again
Describe the bug
When there are a few hundred aiida jobs running on external cluster and the local machine is rebooted all the running jobs become unreachable on completion, i.e. they get stuck in waiting state and cannot be played or killed or salvaged in any manner. I have tried using
get_manager().get_process_controller().continue_process(pk)
but it doesn't do anything either.Now there's a second level to this issue. All these unreachable processes still occupy the daemon slots, making aiida complain that there aren't enough daemons available even when there are zero jobs running. Since these stuck jobs cannot be killed, the only way to eliminate them is to delete the nodes, but that is not an option if one wants to salvage the calculations and use their output in future calculations. For example - if one is running a PwRelaxWorkChain and only the final scf on relaxed structure gets stuck, making the entire workchain unreachable, it would be desirable to only run the final scf instead of running the entire PwRelaxWorkChain. It is even more important to salvage the already completed calculations when the mother workchain is much more complex than PwRelaxWorkChain.
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
Your environment
I have had this issue on 2 separate environments.
First environment
Second environment
Additional context
From what I understand it is the rabbitmq server that's causing the jobs to become unreachable. A similar issue seems to be still unresolved.