Open agoscinski opened 2 months ago
Thanks @agoscinski I write things here, only not to forget
Agreed, the process doesn't seem to be completely taken off from the concurrent Q, it's still there until its exponential back-off thing finishes, even though it disappears from verdi process list
One other problem:
Right now, if you do first without force kill verdi process kill <pk>
and then press Ctrl +C
because it has stuck and do again verdi process kill <pk> -fk
, it cannot force kill. Because a killing signal has already been queued (and stuck) in the process:
09/30/2024 12:30:13 PM <4187692> aiida.engine.runners: [WARNING] runner received interrupt, process 72 already being killed
WARNING: aiida.engine.runners: runner received interrupt, process 72 already being killed
and says:
raise TransportTaskException(f'kill_calculation failed {max_attempts} times consecutively') from exception
aiida.common.exceptions.TransportTaskException: kill_calculation failed 5 times consecutively
The other thing, is verdi process kill <pk>
sometimes gets stuck, even though the process is already killed. (As agreed this should be listed as a different bug- not for this PR)
raise TransportTaskException(f'kill_calculation failed {max_attempts} times consecutively') from exception aiida.common.exceptions.TransportTaskException: kill_calculation failed 5 times consecutively
This part is key. If you want to kill a CalcJob
that was stuck in the EBM because of a problem with the transport, the verdi process kill
command will first call the kill command through the scheduler of the CalcJob
because it wants to actually kill the associated job on the computer (before it kills the CalcJob
in AiiDA itself). But if the transport is not working, that kill command over the scheduler will fail because it requires the transport.
The question is what the code should do in this case. Maybe the EBM should be disabled when calling the scheduler's kill command. But then again, what if the CalcJob
was running just fine without transport issues but the user wants to kill it for another reason, but then it just so happens that the transport has an issue when performing the kill, ignoring it in this case would be a bad idea. So we would really have to check whether the job was already in EBM before the kill call, and only then ignore any exceptions and not go in the EBM again.
So we would really have to check whether the job was already in EBM before the kill call, and only then ignore any exceptions and not go in the EBM again.
Yes, exactly! I was thinking about this as well. I think this idea makes a lot of sense; I'm going to change this PR in this way and push again.
Once done, this PR should address all possibilities:
CalcJob
got stuck in EBM
verdi process kill
also gets stuck (transport not available)
verdi process kill
TIMEOUT after 5 Seconds, and echo if you want to kill with EBM do the next line:verdi process kill --timeout 0
(0 for infinite, theoretically can take up to 5 minutes!)verdi process kill -F
do not even try to open a transport
, just kill
In this case, user accepts consequences: might still be running an HPC, but not monitored, retrieved, etc.verdi process kill
transport becomes available, but it gets stuck because CalcJob
is sleeping.
--timeout
default 5 seconds)CalcJob
has no problem (e.g. was submitted when connection was available)
verdi process kill
gets stuck (transport not available) verdi process kill
successful (transport is still available.) Don't worry about tests failing. They require https://github.com/aiidateam/plumpy/pull/288 to function.. which I couldn't figure out how to properly hook it as dependency.
This allows plumpy ignore the callback so it does not get stuck in it but directly transitions the process to killed. The disadvantage is that the job behind the process might not be killed as we do not wait for. But I think that can be clearly communicated to the user in the help message and it has a similar logic as
kill -9 <PID>
.This PR goes together with the PR in plumpy https://github.com/aiidateam/plumpy/pull/288
TODO: As you feared @khsrali with this solution the worker process will not be killed, so the upload async function still continues, so maybe we still need to find a solution that actually sends the message back to aiida-core. It is really weird, I feel like
_stepping
part in plumpy is buggy, because it only the transition function actually returns back to the sender of the message (whereverdi process kill <PID>
was sent), while only call the callback docalcjobs.tasks.task_kill_job
in the _stepping part kills the process occupying the worker (when removing the exponential backof from the kill), but it should do both. I have this a bug on my system also when just killing local jobs that sleep. Maybe we can check on your machine.Co-author @khsrali