NixOS / nixops

NixOps is a tool for deploying to NixOS machines in a network or cloud.
https://nixos.org/nixops
GNU Lesser General Public License v3.0
1.78k stars 365 forks source link

systemd service failure inhibits system configuration activation #1535

Open PAI5REECHO opened 1 year ago

PAI5REECHO commented 1 year ago

Whenever a nixops deployment is made on a system with a systemd service in a activating (auto-restart) or failed state the deployment fails. I don't understand why nixops is designed in this way though.

test.........> setting up tmpfiles
test.........> the following new units were started: systemd-coredump@194-238204-0.service
test.........> warning: the following units failed: restic-backups-external.service
test.........> 
test.........> ● test.service - test
test.........>      Loaded: loaded (/etc/systemd/system/test.service; linked; preset: enabled)
test.........>      Active: activating (auto-restart) since Sun 2022-08-14 12:00:08 UTC; 2h 9min ago
test.........> TriggeredBy: ● test.timer
test.........>    Main PID: 8780 (code=exited, status=1/FAILURE)
test.........>         CPU: 512ms
test.........> error: Traceback (most recent call last):
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 906, in worker
    raise Exception(
Exception: unable to activate new configuration (exit code 4)

Traceback (most recent call last):
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/bin/.nixops-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/__main__.py", line 56, in main
    args.op(args)
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/script_defs.py", line 715, in op_deploy
    depl.deploy(
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1365, in deploy
    self.run_with_notify("deploy", lambda: self._deploy(**kwargs))
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1354, in run_with_notify
    f()
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1365, in <lambda>
    self.run_with_notify("deploy", lambda: self._deploy(**kwargs))
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1300, in _deploy
    self.activate_configs(
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 947, in activate_configs
    raise Exception(
Exception: activation of 1 of 1 machines failed (namely on ‘test’)
roberth commented 1 year ago

Me neither, if what you're saying is that something was skipped because of the error.

Stopping a deployment half way is incompatible with declarative deployments that do not specify dependencies (we don't) and it is also incompatible with the idea of letting the distributed system converge towards an acceptable (or fully) operational state. That said, using the deployment process for feedback about the system seems useful. Did your deployment skip anything because of the error? If so, that would be an issue that needs correcting.

Also we shouldn't be emitting a stack trace for this type of error and the log should be clear about what did and did not happen.

TODO

PAI5REECHO commented 1 year ago

Did your deployment skip anything because of the error?

Yes, the system activation fails due to a failing or pending systemd service, so no changes to the system are applied which is unexpected. Activation shouldn't depend on the health of systemd services.