Open RuRo opened 3 months ago
nix-daemon
uses a forking model where the lifetime of the connection is coupled with long-running operations, both from the client as well as server's perspective of "operations"
So KillMode=process
is intentional, and while it's not the only possible design for a system Nix service, changing it would require a significant architectural change.
I don't quite understand your explanation. KillMode
is supposed to control what happens when somebody (or something) requests for the nix-daemon
service to be stopped. For example, this can happen if the user runs
sudo systemctl stop nix-daemon.service
or if the system is currently running out of memory and the user has configured systemd-oomd
in such a way that it decided that nix-daemon.service
should be killed to relieve memory pressure.
It seems to me that in these circumstances, terminating all current builds is exactly what should happen. nix-daemon
runs its local build processes as direct children, in the same slice and (by default) in the same cgroup.
I don't think that it makes sense for systemctl stop nix-daemon.service
to keep the builds going as orphan processes. Are the builders even capable of storing their results to /nix/store
after they finish if no nix-daemon
is running at that point?
Isn't the stop behavior also used when updating the unit, at least on NixOS? In those cases it's usually not required to stop everything; that would be disruptive, as I described, although it'd be wise to gracefully restart any clients. It's certainly not without flaws.
Are the builders even capable of storing their results to
/nix/store
after they finish if nonix-daemon
is running at that point?
I haven't encountered any problems that would suggest that this breaks. As far as I can tell, the kernel and/or systemd don't release resources until the last worker is gone.
Some small changes we could make to improve matters:
Isn't the stop behavior also used when updating the unit, at least on NixOS?
Hm, I was under the impression that there was some way to support reloading services without killing them by using ExecReload
and SIGHUP
, but that might only be relevant when the configuration of the program itself (ie nix.conf
) is changed (systemctl reload
), not when the service itself (ie nix-daemon.service
) is changed (systemctl restart/daemon-reload
).
Killing all builds due to a restart might indeed be undesirable, I haven't thought of that. I wonder if there is some way to distinguish between these two cases.
Nix could also do something systemd does to update the PID 1: it could serialize its state and re-execute itself (or let the service manager do that). For that, the socket will need to be kept alive by systemd and passed via a file descriptor. But I don't think this works with long-running connections.
We can't serialize the worker/socket bidirectional pipes, but the protocol could be updated to allow the daemon worker to initiate a reset. Clients that support this can then transparently reopen the connection, so that they get a freshly configured worker for their next operations.
For those connections to complete their current operation without error, it requires that pre-existing worker processes in the slice are not terminated on restart. This can be configured in systemd.
It does have the side effect that defunct process of some kind can stick around indefinitely, but this is actually a different problem that should be solved separately and not rely on regular systemctl restart
s.
Describe the bug
The
nix-daemon.service.in
file (which is used in NixOS to generate/etc/systemd/system/nix-daemon.service
) setsKillMode=process
.Quoting from the
systemd.kill
man page (emphasis mine):Is there a good reason, why
KillMode=process
is used here? I think that usingcontrol-group
ormixed
seems more appropriate fornix-daemon
.I suspect, that this misconfiguration might be the cause of some issues regarding
nix-daemon
not behaving well under OOM pressure on NixOS. Also, see systemd/systemd#32687.Priorities
Add :+1: to issues you find important.