It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
272 stars 21 forks source link

Try harder to kill local processes when worker stops #576

Closed Kobzol closed 1 year ago

Kobzol commented 1 year ago

With this PR, the worker will attempt to send SIGINT (or SIGTERM, if they refuse to cooperate) to task (grandchild) processes when the worker receives a stop command or a (preventible) signal. Before the (grandchild) processes would simply continue executing.

It's not bulletproof, if the worker is SIGKILLED, the processes might continue to live, but AFAIK this is an unresolvable problem on Linux without elevated privileges (https://lists.kernelnewbies.org/pipermail/kernelnewbies/2012-August/005988.html).

In theory, we could get rid of the PR_SET_PDEATHSIG with this PR. It would provide us slightly better task startup performance, at the cost of not killing even the main (not grandchild) processes when the worker is stopped with a SIGKILL.

Best reviewed commit by commit.