aarpon / hrm

The Huygens Remote Manager is an open-source, efficient, multi-user web-based interface to the Huygens software by Scientific Volume Imaging for parallel batch deconvolutions.
http://huygens-rm.org
Other
8 stars 9 forks source link

Dealing with zombie processes in the queue #2

Closed zindy closed 6 years ago

zindy commented 6 years ago

Hi Aaron, Daniel :)

So, here is my bodge!

I had a look at the logs when restarting hrmd and noticed that for each "zombie" job in the queue, an error (Failed killing parent process) was logged. This log is generated in inc/shell/ExternalProcess.php

So there is a question regarding what this actually means: 'is the parent process already dead or can it not be killed?' Being able to differentiate between these two conditions may be important, I don't know.

However, by setting $noParent = True (after we log the failure condition for future inspection), we have in inc/job/JobQueue.php line 208:

$killed = $proc->killHucoreProcess($pid); resulting in $killed = True: We have just successfully pretended that we did indeed kill the process, and thus is removed from the queue.

I accept I do not understand the details of how this (should) work(s)!

Kind regards, Egor

danielsevilla commented 6 years ago

Hi Egor,

Good catch! ;)

Here my 2 cents:

According to the PHP posix_kill docs and the underlying kill(2) an error (failed to kill the process) can be returned when:

   EINVAL An invalid signal was specified.

   EPERM  The process does not have permission to send the signal to any
          of the target processes.

   ESRCH  The process or process group does not exist.  Note that an
          existing process might be a zombie, a process that has
          terminated execution, but has not yet been wait(2)ed for.

Additionally,

   In 2.6 kernels up to and including 2.6.7, there was a bug that meant
   that when sending signals to a process group, kill() failed with the
   error EPERM if the caller did not have permission to send the signal
   to any (rather than all) of the members of the process group.
   Notwithstanding this error return, the signal was still delivered to
   all of the processes for which the caller had permission to signal.

Granted, those are old kernels! But anyway, even when an error is returned we could have the situation that the process has indeed been killed.

I think we could then check if the process does exist (via e.g. 'ps'). If it still does exist we may want to keep it in the DB. If it no longer exists or it's a zombie process then we can remove it from the DB.

Also, by looking at your report I think I found a logic mistake in function 'killJobs()'. If 'result' is set to false for one of the jobs then all subsequent jobs in the loop will be set to false too, without even trying to kill them.

I really hope we can make this all more robust just in case the new Queue Manager gets delayed.

Thank you, Daniel

zindy commented 6 years ago

Hi Daniel,

can we check if the process's PID exists as a first step? That would answer "is it already dead?"

This bit of code on stackoverflow looks like an easy way to check for the PID.

Cheers, Egor

danielsevilla commented 6 years ago

Hi Egor,

Yes, totally! It's rather strange that such check is not in place yet.

Then being very defensive we could check it a second time after the call to posix_kill simply because it may as well misbehave (as in the kernel bug I referred to in my previous post).

Cheers, Daniel