NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.66k stars 13.11k forks source link

nix offline detection is masking errors, breaking system.autoUpgrade #274146

Open dcarosone opened 7 months ago

dcarosone commented 7 months ago

Describe the bug

autoUpgrade service doesn't fail when steps within the process have errors. nixos-rebuild seems to be swallowing them.

As well as simply not doing the intended job of upgrading, this can actually cause configuration to go backwards.

Steps To Reproduce

Steps to reproduce the behaviour:

  1. enable the service on a laptop using wifi, with a persistent timer (the default)
  2. suspend the machine, and resume the following morning after the scheduled timer expires (04:40 default)
  3. the service can start immediately, before network connectivity is available. It has a dependency on network-online.target but this is not meaningful after a resume, unfortunately.
  4. the upgrade has no network, and so does not fetch channel updates or update the specified flake, but this doesn't generate an error that systemd sees. The build proceeds anyway.
  5. Even if the --refresh argument is given with a flake, it will use the previously-cached fetch from the last run, which should be considered stale and invalid. The build proceeds anyway.
  6. If the system had been manually updated (from a more recent checkout /etc/nixos/flake.nix for example), the autoupgrade service will build and switch to the older revision, effectively rolling back unexpectedly.

Expected behaviour

Issues and errors, such as lack of network connectivity for an upgrade, should be considered as errors for the rebuild, and cause the service to fail (so it can optionally then be configured to retry with a delay).

The --refresh argument should consider cached copies of the flake source as invalid (as documented) and refuse to use them.

Screenshots

In the below log, wifi was disabled. The autoUpgrade service is configures with a git+ssh:// flake repo.

Without --refresh in the options list, the ssh errors don't appear, presumably because the 'network-dependent features' have been disabled. With --refresh they're tried anyway but the errors are ignored.

Dec 14 13:51:34 rocinante systemd[1]: Starting NixOS Upgrade...
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: warning: you don't have Internet access; disabling some network-dependent features
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [4B blob data]
Dec 14 13:51:34 rocinante nixos-upgrade-start[98051]: ssh: connect to host soft-serve port 23231: Network is unreachable
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: fatal: Could not read from remote repository.
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: Please make sure you have the correct access rights
Dec 14 13:51:34 rocinante nixos-upgrade-start[98050]: and the repository exists.
Dec 14 13:51:34 rocinante nixos-upgrade-start[98047]: [148B blob data]
Dec 14 13:51:35 rocinante nixos-upgrade-start[98045]: building the system configuration...
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: warning: you don't have Internet access; disabling some network-dependent features
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [4B blob data]
Dec 14 13:51:35 rocinante nixos-upgrade-start[98062]: ssh: connect to host soft-serve port 23231: Network is unreachable
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: fatal: Could not read from remote repository.
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: Please make sure you have the correct access rights
Dec 14 13:51:35 rocinante nixos-upgrade-start[98061]: and the repository exists.
Dec 14 13:51:35 rocinante nixos-upgrade-start[98058]: [148B blob data]
Dec 14 13:51:38 rocinante nixos-upgrade-start[98078]: updating GRUB 2 menu...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: NOT restarting the following changed units: nixos-upgrade.service
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: activating the configuration...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] creating new generation in /run/agenix.d/8
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] decrypting secrets...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: decrypting '/nix/store/hz41qqz5x88yk1jlwsj3shbqx74w904n-nm-geek-env.age' to '/run/agenix.d/8/nm-geek-env'...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] symlinking new secrets to /run/agenix (generation 8)...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] removing old secrets (generation 7)...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: [agenix] chowning...
Dec 14 13:51:39 rocinante nixos-upgrade-start[98078]: setting up /etc...
Dec 14 13:51:40 rocinante nixos-upgrade-start[98078]: reloading user units for dan...
Dec 14 13:51:41 rocinante nixos-upgrade-start[98078]: setting up tmpfiles
Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Deactivated successfully.
Dec 14 13:51:42 rocinante systemd[1]: Finished NixOS Upgrade.
Dec 14 13:51:42 rocinante systemd[1]: nixos-upgrade.service: Consumed 2.772s CPU time, no IP traffic.

Additional context

It also seems to rebuild and switch when there's full network connectivity but no new revisions are fetched, regardless of whether this is because (without --refresh) the content is still within TTL, or simply no new revisions are found on the git repo. I don't think this is necessary.

It might be helpful to have an option that's the inverse of --offline that seems to be getting detected.. something like --require-online such that it can bail directly from this autodetection before even getting to the other steps. But it should still bail on those other errors, and it should very-definitely not roll back by building and switching to a stale revision.

See also:

Notify maintainers

Metadata

Add a :+1: reaction to issues you find important.

nixos-discourse commented 7 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/deployment-tools-evaluating-nixops-deploy-rs-and-vanilla-nix-rebuild/36388/21

dcarosone commented 7 months ago

Hm, not all errors are getting squashed. From an earlier run, where I hadn't set the ?ref=branchname argument in the flake config:

Dec 12 17:02:02 rocinante systemd[1]: Starting NixOS Upgrade...
Dec 12 17:02:02 rocinante nixos-upgrade-start[27669]: [12B blob data]
Dec 12 17:02:02 rocinante nixos-upgrade-start[27683]: fatal: couldn't find remote ref refs/heads/0d0e27dfa3c393811ea9d2fc6f538e7f17b8772c
Dec 12 17:02:02 rocinante nixos-upgrade-start[27669]: [10B blob data]
Dec 12 17:02:02 rocinante nixos-upgrade-start[27669]:        … while fetching the input 'git+ssh://rocinante@soft-serve:23231/geek/nixos'
Dec 12 17:02:02 rocinante nixos-upgrade-start[27669]:        error: program 'git' failed with exit code 128
Dec 12 17:02:02 rocinante systemd[1]: nixos-upgrade.service: Main process exited, code=exited, status=1/FAILURE
Dec 12 17:02:02 rocinante systemd[1]: nixos-upgrade.service: Failed with result 'exit-code'.
Dec 12 17:02:02 rocinante systemd[1]: Failed to start NixOS Upgrade.
Dec 12 17:02:02 rocinante systemd[1]: nixos-upgrade.service: Consumed 82ms CPU time, received 3.1K IP traffic, sent 4.2K IP traffic.

So.. uhh..

is the automatic offline detection causing errors to be ignored from network-using tasks? If so that's terribly counterproductive in at least this case, and should either be fixed or warrants the --require-online reverse option.

dcarosone commented 7 months ago

After pondering on this for a while, I'm becoming more convinced that the issue is nix itself:

I have masked this with a service preStart that checks ssh connectivity to the git repo server, which will fail and allow systemd retries. But that should not be necessary and these errors should be returned.