flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

sdexec: take extra measures to ensure cleanup #6011

Open garlick opened 4 months ago

garlick commented 4 months ago

Problem: job-exec's mechanism for terminating a job could leave processes running.

When sdexec is configured, a job's processes are contained in a cgroup so we should be able to do a better job of cleaning up.

grondo commented 4 months ago

I'm wondering if sdexec will need assistance from the IMP here, since the processes running within the cgroup are owned by other users that the system instance user. Probably needs some experimentation to get the right approach.

garlick commented 4 months ago

I'm wondering if sdexec will need assistance from the IMP here, since the processes running within the cgroup are owned by other users that the system instance user. Probably needs some experimentation to get the right approach.

Yeah, I think that may be right.

One observation is that it appears systemd running as the flux user is able to successfully signal the IMP, since presumably it is still running with flux as its real user id. Given that, I wonder if we could arrange for the IMP to forward signals to all members of the control group?

It sounds like a nice cgroup.kill interface was proposed back in 2021 (see https://lwn.net/Articles/855049/). Not sure how far along that is. We could press that button from the IMP when it receives SIGTERM perhaps?

More study needed.

garlick commented 4 months ago

I wonder if we could arrange for the IMP to forward signals to all members of the control group?

Digging deeper, the IMP already forwards signals to the shell which forwards them to its children. The problem is that SIGKILL can't be forwarded. Confirmed by an experiment:

I found that cgroup.kill does exist on my test system. Interestingly, I can press this button manually as the flux user and all processes in the cgroup immediately get SIGKILL. So we don't even need the services of the imp to use it.

Unfortunately it is not present on our toss 4 systems :-(

garlick commented 4 months ago

I guess the IMP could mimic systemd's behavior.

It uses cgroup.kill if available, but falls back to iterating over cgroup.procs.

https://github.com/systemd/systemd/blob/main/src/basic/cgroup-util.c#L312

grondo commented 4 months ago

Yeah, I like that idea.

garlick commented 4 months ago

One note: it appears that the default KillMode on our transient units is control-group, so one would think that merely stopping the unit would sufficient to fully clean up the cgroup. E.g. systemd should be using the code referenced above to do it.

$ sudo flux exec -r 0 systemctl --user show --property KillMode imp-shell-0-f3ox6qkjs9h.service
KillMode=control-group

In reality I think it's sending the SIGTERM to the cgroup, then following up with SIGKILL after 90s to only the main process. We may need to dig a bit deeper into the unit config options described here to figure out if there is a better setup for us:

https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html

garlick commented 4 months ago

Oh,wait. I guess probably the version of systemd I have isn't yet using cgroup.kill so systemd running as flux cant signal the shell running as garlick. The SIGTERMs are forwarded by the IMP, and the SIGKILLs take the IMP out. So back to the original idea of having the IMP iterate over the cgroup. Sorry for the noise!