Open garlick opened 4 months ago
I'm wondering if sdexec will need assistance from the IMP here, since the processes running within the cgroup are owned by other users that the system instance user. Probably needs some experimentation to get the right approach.
I'm wondering if sdexec will need assistance from the IMP here, since the processes running within the cgroup are owned by other users that the system instance user. Probably needs some experimentation to get the right approach.
Yeah, I think that may be right.
One observation is that it appears systemd running as the flux
user is able to successfully signal the IMP, since presumably it is still running with flux
as its real user id. Given that, I wonder if we could arrange for the IMP to forward signals to all members of the control group?
It sounds like a nice cgroup.kill
interface was proposed back in 2021 (see https://lwn.net/Articles/855049/). Not sure how far along that is. We could press that button from the IMP when it receives SIGTERM perhaps?
More study needed.
I wonder if we could arrange for the IMP to forward signals to all members of the control group?
Digging deeper, the IMP already forwards signals to the shell which forwards them to its children. The problem is that SIGKILL can't be forwarded. Confirmed by an experiment:
systemctl --user stop <unit>
flux-shell
and the shell script continue to runI found that cgroup.kill
does exist on my test system. Interestingly, I can press this button manually as the flux user and all processes in the cgroup immediately get SIGKILL. So we don't even need the services of the imp to use it.
Unfortunately it is not present on our toss 4 systems :-(
I guess the IMP could mimic systemd's behavior.
It uses cgroup.kill
if available, but falls back to iterating over cgroup.procs
.
https://github.com/systemd/systemd/blob/main/src/basic/cgroup-util.c#L312
Yeah, I like that idea.
One note: it appears that the default KillMode
on our transient units is control-group
, so one would think that merely stopping the unit would sufficient to fully clean up the cgroup. E.g. systemd should be using the code referenced above to do it.
$ sudo flux exec -r 0 systemctl --user show --property KillMode imp-shell-0-f3ox6qkjs9h.service
KillMode=control-group
In reality I think it's sending the SIGTERM to the cgroup, then following up with SIGKILL after 90s to only the main process. We may need to dig a bit deeper into the unit config options described here to figure out if there is a better setup for us:
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html
Oh,wait. I guess probably the version of systemd I have isn't yet using cgroup.kill
so systemd running as flux
cant signal the shell running as garlick
. The SIGTERMs are forwarded by the IMP, and the SIGKILLs take the IMP out. So back to the original idea of having the IMP iterate over the cgroup. Sorry for the noise!
Problem: job-exec's mechanism for terminating a job could leave processes running.
When sdexec is configured, a job's processes are contained in a cgroup so we should be able to do a better job of cleaning up.