On tuolumne, we're seeing sets of drained nodes with 'unkillable processes' even though there are no processes running when admins investigate after the fact.
In one instance, a job was canceled at 11:57 and nodes were drained after the job-exec timeout at 12:12. Note that this indicates the job-exec module still thought the sdexec launched subprocesses were active at this time.
On one of the drained nodes, a log for the transient job service unit was obtained from journactl (note that you run this as root not the flux user)
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Main process exited, code=exited, status=137/n/a
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group `, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed with result 'exit-code'.
I appears that the unit exited immediately (Failed with result 'exit-code'). Also the cgroup /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service remains on the system, even though there are no processes in it.
This is probably related to #6011.
Also related, perhaps we need to set TimeoutStopSec to infinity so that system will wait until all processes in the cgroup exit before considering the unit stopped/exited.
On tuolumne, we're seeing sets of drained nodes with 'unkillable processes' even though there are no processes running when admins investigate after the fact.
In one instance, a job was canceled at 11:57 and nodes were drained after the job-exec timeout at 12:12. Note that this indicates the job-exec module still thought the sdexec launched subprocesses were active at this time.
On one of the drained nodes, a log for the transient job service unit was obtained from
journactl
(note that you run this as root not theflux
user)I appears that the unit exited immediately (
Failed with result 'exit-code'
). Also the cgroup/user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service
remains on the system, even though there are no processes in it.This is probably related to #6011.
Also related, perhaps we need to set TimeoutStopSec to
infinity
so that system will wait until all processes in the cgroup exit before considering the unit stopped/exited.