flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

sdexec: error signaling job tasks causes transient job service unit to exit without removing cgroup #6377

Closed grondo closed 2 weeks ago

grondo commented 1 month ago

On tuolumne, we're seeing sets of drained nodes with 'unkillable processes' even though there are no processes running when admins investigate after the fact.

In one instance, a job was canceled at 11:57 and nodes were drained after the job-exec timeout at 12:12. Note that this indicates the job-exec module still thought the sdexec launched subprocesses were active at this time.

On one of the drained nodes, a log for the transient job service unit was obtained from journactl (note that you run this as root not the flux user)

Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Main process exited, code=exited, status=137/n/a
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group `, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308719 (flux-shell) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308720 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308721 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308722 (xxx) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308723 (yyyy) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Killing process 308724 (date) with signal SIGKILL.
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed to kill control group /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service, ignoring: Operation not permitted
Oct 16 11:57:26 tuolumnexxx systemd[83247]: imp-shell-168-fD4g2D7SS5D.service: Failed with result 'exit-code'.

I appears that the unit exited immediately (Failed with result 'exit-code'). Also the cgroup /user.slice/user-767.slice/user@767.service/imp-shell-168-fD4g2D7SS5D.service remains on the system, even though there are no processes in it.

This is probably related to #6011.

Also related, perhaps we need to set TimeoutStopSec to infinity so that system will wait until all processes in the cgroup exit before considering the unit stopped/exited.

garlick commented 2 weeks ago

Fixed by #6408 - reopen if not.