Closed zindy closed 6 years ago
Hi Egor,
Good catch! ;)
Here my 2 cents:
According to the PHP posix_kill docs and the underlying kill(2) an error (failed to kill the process) can be returned when:
EINVAL An invalid signal was specified.
EPERM The process does not have permission to send the signal to any
of the target processes.
ESRCH The process or process group does not exist. Note that an
existing process might be a zombie, a process that has
terminated execution, but has not yet been wait(2)ed for.
Additionally,
In 2.6 kernels up to and including 2.6.7, there was a bug that meant
that when sending signals to a process group, kill() failed with the
error EPERM if the caller did not have permission to send the signal
to any (rather than all) of the members of the process group.
Notwithstanding this error return, the signal was still delivered to
all of the processes for which the caller had permission to signal.
Granted, those are old kernels! But anyway, even when an error is returned we could have the situation that the process has indeed been killed.
I think we could then check if the process does exist (via e.g. 'ps'). If it still does exist we may want to keep it in the DB. If it no longer exists or it's a zombie process then we can remove it from the DB.
Also, by looking at your report I think I found a logic mistake in function 'killJobs()'. If 'result' is set to false for one of the jobs then all subsequent jobs in the loop will be set to false too, without even trying to kill them.
I really hope we can make this all more robust just in case the new Queue Manager gets delayed.
Thank you, Daniel
Hi Daniel,
can we check if the process's PID exists as a first step? That would answer "is it already dead?"
This bit of code on stackoverflow looks like an easy way to check for the PID.
Cheers, Egor
Hi Egor,
Yes, totally! It's rather strange that such check is not in place yet.
Then being very defensive we could check it a second time after the call to posix_kill simply because it may as well misbehave (as in the kernel bug I referred to in my previous post).
Cheers, Daniel
Hi Aaron, Daniel :)
So, here is my bodge!
I had a look at the logs when restarting hrmd and noticed that for each "zombie" job in the queue, an error (Failed killing parent process) was logged. This log is generated in inc/shell/ExternalProcess.php
So there is a question regarding what this actually means: 'is the parent process already dead or can it not be killed?' Being able to differentiate between these two conditions may be important, I don't know.
However, by setting
$noParent = True
(after we log the failure condition for future inspection), we have in inc/job/JobQueue.php line 208:$killed = $proc->killHucoreProcess($pid);
resulting in$killed = True
: We have just successfully pretended that we did indeed kill the process, and thus is removed from the queue.I accept I do not understand the details of how this (should) work(s)!
Kind regards, Egor