NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.38k stars 14.33k forks source link

Activation script snippet "nix" failed #72372

Closed flokli closed 4 years ago

flokli commented 5 years ago

Describe the bug On switching, NixOS stops the nix-daemon, then parts in the "nix" snippet of the activation script fail, then it starts the nix-daemon again.

To Reproduce Steps to reproduce the behavior:

  1. ...nixos-rebuild switch on master, with a changed nix-daemon

Expected behavior nix-daemon updates are handled in a more graceful fashion.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.3.7, NixOS, 20.03.git.64eab81 (Markhor)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.1`

Output:

stopping the following units: NetworkManager-wait-online.service, NetworkManager.service, accounts-daemon.service, alsa-store.service, audit.service, avahi-daemon.service, avahi-daemon.socket, bluetooth.service, colord.service, cups-browsed.service, cups.service, cups.socket, docker-prune.timer, home-manager-flokli.service, kmod-static-nodes.service, network-link-vboxnet0.service, network-local-commands.service, nix-daemon.service, nix-daemon.socket, nix-gc.timer, nscd.service, powertop.service, rngd.service, rtkit-daemon.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, tlp.service, udisks2.service, upower.service, vboxnet0.service, wpa_supplicant.service
NOT restarting the following changed units: display-manager.service, getty@tty1.service, libvirt-guests.service, libvirtd.service, systemd-backlight@backlight:intel_backlight.service, systemd-backlight@leds:dell::kbd_backlight.service, systemd-fsck@dev-disk-by\x2duuid-027E\x2d4751.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-udev-settle.service, systemd-update-utmp.service, systemd-user-sessions.service, user-runtime-dir@1000.service, user@1000.service
activating the configuration...
setting up /etc...
error: cannot connect to daemon at '/nix/var/nix/daemon-socket/socket': Connection refused
Activation script snippet 'nix' failed (1)
restarting systemd...
reloading user units for flokli...
setting up tmpfiles
reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, sys-fs-fuse-connections.mount, sys-kernel-debug.mount, tmp.mount
restarting the following units: polkit.service, sshd.service, systemd-journald.service
starting the following units: NetworkManager-wait-online.service, NetworkManager.service, accounts-daemon.service, alsa-store.service, audit.service, avahi-daemon.socket, bluetooth.service, colord.service, cups-browsed.service, cups.socket, docker-prune.timer, home-manager-flokli.service, kmod-static-nodes.service, network-link-vboxnet0.service, network-local-commands.service, nix-daemon.socket, nix-gc.timer, nscd.service, powertop.service, rngd.service, rtkit-daemon.service, systemd-binfmt.service, systemd-machined.service, systemd-modules-load.service, systemd-networkd-wait-online.service, systemd-networkd.service, systemd-resolved.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, tlp.service, udisks2.service, upower.service, vboxnet0.service, wpa_supplicant.service
the following new units were started: docker.service, docker.socket, var-lib-docker-btrfs.mount
warning: error(s) occurred while switching to the new configuration
jluttine commented 4 years ago

If I reboot after this error, will the system be ok then? Is it just that the switching of the running system somehow doesn't work properly but if you reboot afterwards, it works fine? Or is there something broken in the system if I see this error and rebooting doesn't help?

mkg20001 commented 4 years ago

I found that this error usually goes away after re-running the rebuild command.

symphorien commented 4 years ago

This now also affects 19.09 (stable release). I think https://github.com/NixOS/nixpkgs/pull/76785 is the cause.

mkg20001 commented 4 years ago

I think it happens during large upgrades during which nix is upgraded. It would make sense to not restart nix mid-script but rather do it at the end, if that's what's causing it.

cleverca22 commented 4 years ago

the ${nix}/bin/nix ping-store --no-net within the activation script should probably be changed over to: ${nix}/bin/nix ping-store --no-net --store local

that tells nix to just open /nix directly, rather then reaching out to a nix-daemon to get things done

flokli commented 4 years ago

@cleverca22 this will still fail parts of activation if you restart services somehow interacting with the nix store. An alternative would be to restart the nix daemon if it has changed before doing that for all other units.

DianaOlympos commented 4 years ago

Do we have a fix somewhere ? it is happening in prod and breaking our deploys on 19.09.

I am happy to write a PR, just not sure to understand what to touch

flokli commented 4 years ago

@DianaOlympos I assume nixos/modules/system/activation/switch-to-configuration.pl needs to be updated to restart nix-daemon.service (if necessary), then restart the rest of the services.

DianaOlympos commented 4 years ago

Oh my, i will really write some perl. Not sure i am the best for this one :smile: Especially the switch, i had problems reading it before. Ok will try to have a look, if noone else can.

It does affect 19.09

DianaOlympos commented 4 years ago

so i think it is even worse.

https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/activation/switch-to-configuration.pl#L372-L386

This will stop the previously running nix-daemon.service but the activation phase needs it. So i am not 100% sure of what to do here ? I can't restart the daemon because i am not activated yet... no ? Or should we filter the nix-daemon out of the stop list and then restart it at the end ? but then we may have a nix version that is not the one used by the nix-daemon.

Or i missed something

nixos-discourse commented 4 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/help-wanted-updating-nix-as-part-of-nixos-rebuild-switch/5785/1

shurrman commented 4 years ago

Facing same issue on 19.09 :-( Still no fix/crunch?

DianaOlympos commented 4 years ago

https://discourse.nixos.org/t/help-wanted-updating-nix-as-part-of-nixos-rebuild-switch/5785/2?u=dianaolympos This is the best we have from @mkg20001 but i do not have the time to do it right now nor the brain power.

If someone want to do it though and to push it into 20.03 it would be nice. It will not solve the problem we face when going from 19.09 as released (AMI) into 19.09 current stable, but at least it would provide a path forward.

shurrman commented 4 years ago

I actually tried to disable both services restarting/reloading, did not help (I actually use NixOps to deploy 19.09 to AWS EC2 running NixOS 19.09 AMI) ` /* systemd.services.nix = { reloadIfChanged = false; restartIfChanged = false; stopIfChanged = false; };

systemd.services.nix-daemon = { reloadIfChanged = false; restartIfChanged = false; stopIfChanged = false; }; */

`

nixos-discourse commented 4 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/spurious-errors-while-rebuilding/4782/2

LnL7 commented 4 years ago

Fixed in #87182.

majewsky commented 4 years ago

IIUC, this fix is only on master, not on the 20.03 branch? Just want to confirm since I'm still seeing this issue on systems running 20.03.

Ekleog commented 4 years ago

I've just submitted https://github.com/NixOS/nixpkgs/pull/89191 which hopefully backports the relevant fixes from #87182, so as to fix this on 20.03 without breaking backwards-compatibility on the API of nixos-install.

Feel free to test and confirm this works as intended! Reopening as missing backport for the time being :)