aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
430 stars 186 forks source link

`verdi process kill` takes long time for erroneous processes #6524

Open GeigerJ2 opened 2 months ago

GeigerJ2 commented 2 months ago

As noted by @khsrali and me when trying out the new FirecREST implementation. So maybe not the most general example, but this is where it occurred for me: Say, one installs aiida-firecrest and submits a job using that, but forgets to run verdi daemon restart before, leading to the requested transport plugin not being available. verdi daemon logshow shows the

expected exception (though, doesn't actually matter) ```python Error: iteration 1 of do_upload excepted, retrying after 20 seconds Traceback (most recent call last): File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 175, in get_transport transport_class = TransportFactory(transport_type) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 432, in TransportFactory entry_point = BaseFactory(entry_point_group, entry_point_name, load=load) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 75, in BaseFactory return load_entry_point(group, name) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 276, in load_entry_point entry_point = get_entry_point(group, name) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 324, in get_entry_point raise MissingEntryPointError(f"Entry point '{name}' not found in group '{group}'") aiida.common.exceptions.MissingEntryPointError: Entry point 'firecrest' not found in group 'aiida.transports' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 203, in exponential_backoff_retry result = await coro() File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py", line 85, in do_upload with transport_queue.request_transport(authinfo) as request: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ return next(self.gen) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/transports.py", line 78, in request_transport transport = authinfo.get_transport() File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 177, in get_transport raise exceptions.ConfigurationError(f'transport type `{transport_type}` could not be loaded: {exception}') aiida.common.exceptions.ConfigurationError: transport type `firecrest` could not be loaded: Entry point 'firecrest' not found in group 'aiida.transports' ```

and the job is stuck in the state

⏵ Waiting Waiting for transport task: upload

as expected. (though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?)

Now, running verdi process kill leads to the command being stuck for minutes on end, while

`verdi daemon logshow` gives the following output ```python plumpy.process_states.KillInterruption: Killed through `verdi process kill` Task exception was never retrieved future: .execute_coroutine() done, defined at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py:132> exception=KillInterruption('Killed through `verdi process kill`')> Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step result = coro.send(None) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 142, in execute_coroutine future.result(), plumpy.process_states.KillInterruption: Killed through `verdi process kill` ```

showing that plumpy reports that it has killed the process, but it's not being picked up by the daemon. Opening this issue for future reference. Please correct me if I got something wrong, @khsrali, or if you have another, simpler and more general example where this occurs.

EDIT: To add here, running verdi daemon stop then

leads to an update in `verdi daemon logshow` ```python Exception in callback Process.kill..done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369 handle: .done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369> Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py", line 370, in done is_all_killed = all(done_future.result()) asyncio.exceptions.CancelledError ```

and starting the daemon again then leads to the process actually entering the QUEUED state (rather than being killed), so the verdi process kill command is somewhat lost.

sphuber commented 2 months ago

though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?

I think it should still say that it is in the waiting state, although it would be better if it was updated with the number of failures. So ideally it would show:

⏵ Waiting         Waiting for transport task: upload (failed 2 times)

or something like that. Once it hits the maximum number of retries (5 by default) it will go in the paused state. The reason for this functionality is that this way transient problems, such as failing to connect due to connection problems, won't fully except the calcjob. Once the problem is fixed, it can be resumed again.

and starting the daemon again then leads to the process actually entering the QUEUED state (rather than being killed), so the verdi process kill command is somewhat lost.

The verdi process kill command performs Remote Procedure Call (RPC), which means it is trying to execute a method of the Process. Since that process is running inside a daemon worker (which is a different system process) it needs to be done as an RPC over RabbitMQ. This then only works if that Process is actually running inside a daemon worker (the reason why the daemon needs to be running for verdi process kill to work). This is why the kill command seems "to get lost" when the daemon is restart.

The real problem I suspect here is in the handling of the Process.kill RPC call. https://github.com/aiidateam/aiida-core/blob/a5da4eda131f844c3639bdb01a256b9e9a7873a2/src/aiida/engine/processes/process.py#L332

Since the process is already waiting for the next try of the failed upload task, the RPC to call kill is not getting through. The verdi process kill command waits for the RPC to be completed by default, so if that call is not going through, it seems to be hanging. I think one thing we should already fix is that the kill procedure for CalcJob should be subclasses from the Process base class. If the CalcJob is in the waiting state with a failed attempt of one of the transport tasks and it gets a kill command, it should just kill immediately and not wait for the next attempt.

GeigerJ2 commented 1 month ago

Thanks for the clarification, @sphuber, and sorry for the slow reply.

I think it should still say that it is in the waiting state, although it would be better if it was updated with the number of failures. So ideally it would show:

⏵ Waiting         Waiting for transport task: upload (failed 2 times)

or something like that. Once it hits the maximum number of retries (5 by default) it will go in the paused state. The reason for this functionality is that this way transient problems, such as failing to connect due to connection problems, won't fully except the calcjob. Once the problem is fixed, it can be resumed again.

Actually, that makes sense, yes.

The verdi process kill command performs Remote Procedure Call (RPC), which means it is trying to execute a method of the Process. Since that process is running inside a daemon worker (which is a different system process) it needs to be done as an RPC over RabbitMQ. This then only works if that Process is actually running inside a daemon worker (the reason why the daemon needs to be running for verdi process kill to work). This is why the kill command seems "to get lost" when the daemon is restart.

Pinging @khsrali here for one example how and why RMQ is used, interesting!

The real problem I suspect here is in the handling of the Process.kill RPC call.

https://github.com/aiidateam/aiida-core/blob/a5da4eda131f844c3639bdb01a256b9e9a7873a2/src/aiida/engine/processes/process.py#L332

Since the process is already waiting for the next try of the failed upload task, the RPC to call kill is not getting through. The verdi process kill command waits for the RPC to be completed by default, so if that call is not going through, it seems to be hanging. I think one thing we should already fix is that the kill procedure for CalcJob should be subclasses from the Process base class. If the CalcJob is in the waiting state with a failed attempt of one of the transport tasks and it gets a kill command, it should just kill immediately and not wait for the next attempt.

Will look into this!