"Safe" auto-upgrades - Githubissues

djahandarie commented 6 years ago

We currently have an auto-upgrade mechanism which runs nixos-rebuild switch --upgrade on a systemd timer.

I think it'd be a massive selling point for NixOS if we could do safe auto-upgrades. Namely, give users an option to add in safety checks (e.g., is the httpd returning the correct response to this test request, etc.), which NixOS tries to run after an upgrade, and if any of them fail, it rolls back.

IMO, the lack of this functionality is why auto-upgrading is not a default for Linux boxes in the real world, and NixOS's ability to do clean rollbacks is what can make it possible.

peterhoeg commented 6 years ago

I must admit I'm not particularly hooked - doesn't you monitoring environment already look at this?

7c6f434c commented 6 years ago

@peterhoeg So you think it is better to include some modules that help the monitoring distinguish failures related to upgrades and also start an automatic rollback when an upgrade causes a problem?

I guess in the ideal world some of the checks could be done before finalising the upgrade…

peterhoeg commented 6 years ago

First of all, I'm not at all opposed to the idea of "safe upgrades" (once we agree on what that actually means).

If you care enough about something to want to define checks to deal with upgrades, you probably already care enough to put other things in place to ensure it stays up - the upgrade point is really only one in many.

vcunat commented 6 years ago

Maybe you'd prefer to have a hydra instance that does the extra tests in VM, and you only auto-upgrade after those checks pass. Rollbacks seem more suitable for emergency situations that weren't handled automatically. EDIT: oh, NixOS is being used in Akamai :-)

djahandarie commented 6 years ago

@peterhoeg As you say, there are various changes to a system which could result in something become unhealthy, and usually "monitoring tools" are the major component in a general control system that is responsible for repairing things (which usually involves a human in the control loop manually responding).

But that doesn't seem like a convincing argument as to why NixOS itself shouldn't have its own control system to keep machines healthy. Its "health check" could be a matter of asking some external service for health information, or we could configure that information per-service, or ..., but given that NixOS can initiate upgrades, it'd also be nice for it to check that they succeeded.

Giving people actuators but then expecting them to build out the rest of the safety control loop seems like a missed opportunity for a safer system.

pierrebeaucamp commented 6 years ago

I'd love to see auto-upgrades. It's one of the reasons why I'm still running CoreOS Container Linux on most of my production machines.

In regards to the health checks, I'd differentiate between system services and user defined applications. I don't think it is common to manually monitor most system services - it's out of scope for most companies. If we're talking about health checks for auto upgrades, I'd expect the OS to check if it can still boot and if it can still bring up my apps.

vcunat commented 6 years ago

@pierrebeaucamp: as written in the OP, there are auto-upgrades already. system.autoUpgrade.enable = true; (Perhaps I misunderstood you.)

pierrebeaucamp commented 6 years ago

I'm aware of that option, but I don't know how safe it would be to run on some production servers (yet alone how well it would play with nixops). I just wanted to express that a safe and built-in upgrade mechanism would be welcomed. I think OP summed it up pretty nicely:

IMO, the lack of this functionality is why auto-upgrading is not a default for Linux boxes in the real world, and NixOS's ability to do clean rollbacks is what can make it possible.

Maybe I should have been more clear in my comment, sorry about that.

mogorman commented 6 years ago

would be nice if we could easily define a script or command to be run at the end of the autoupgrade. and if it exits anything but 0 we roll back

bjornfor commented 6 years ago

This "Automatic roll back" post hasn't been mentioned yet: https://groups.google.com/forum/#!topic/nix-devel/Th-544aQ8Jk

netixx commented 6 years ago

Maybe a first step could be having an option when doing a nixos-rebuild that enables rollback unless confirmed by the user afterwards. This is especially useful when ssh access is lost after a bad rebuild. Also, this rollback should be reboot persistent (i.e. rollback after reboot if confirmation is not given).

To be clearer, this could be a scenario like this:

nixos-rebuild switch --auto-rollback-in 5m
reboot # optionnal
nixos-rollback --abort

Such features exists in the Cisco iOS software:

configure terminal revert time idle 5 ! rollback if no command is entered during 5 minutes => e.g. ssh access is lost
<configure commands>
exit
configure confirm ! confirm modifications (abort rollback)

zimbatm commented 5 years ago

It wouldn't be too hard or controversial to add an --auto-rollback option to nixos-rebuild if any of the systemd units are failing to start, and then add that as an option to the autoUpgrade module.

Then if you want anything more complicated, add the to checks to the specific units postStart or create a new unit with your checks.

Note that the rollback might be failing as well so it isn't entirely fool-proof.

djahandarie commented 5 years ago

@zimbatm That seems like a great way to implement this!

coretemp commented 5 years ago

A true safe upgrade would fully virtualize everything. For example, if a ZFS is upgraded I would like a VM started that tests whether running a program that reads every file on the file system runs correctly. Unless the outcome of this issue is something like that, I don't see the point.

Similarly for changes like the change of naming schemes of networking devices.

If whatever solution you come up with doesn't account for these issues, I would consider it a useless feature.

symphorien commented 5 years ago

one can probably add nixos tests to system.extraDependencies. Then it is "only" a matter of writing nixos tests representative of a specific workflow.

memberbetty commented 5 years ago

Would this also work for https://github.com/NixOS/nixpkgs/issues/52644?

In that case, only when firmware is loaded (not sure whether that's only at boot), it can be determined whether a new configuration works.

Also, by the time a device has lost connectivity, you actually need to physically go to the device in certain environments. Remote management is not available nor practical in all environments.

aanderse commented 5 years ago

I'd like to mention another scenario here. On Debian I can modify my apache configuration files and then run a quick apachectl configtest to ensure I haven't made any mistakes before running systemctl reload apache2.service. This process is not as simple on NixOS (and straight up painful on NixOps) in that I have to make my configuration change (for example: services.httpd.extraConfig = "bad config which will break apache";), run nixos-rebuild build, get the name of the new apache configuration file, then run apachectl configtest new-config-filename and hope the changes to my system aren't dramatic enough to skew the result of apachectl.

So can the definition of a "safe" upgrade include simple file configuration change checks? Obviously it would be very valuable to prevent a system from switching to a new generation automatically if we can determine the new generation is not valid.

zimbatm commented 5 years ago

One way is to move as much checks as possible at build time. I made a preliminary PR for the httpd one ^^

stale[bot] commented 4 years ago

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

djahandarie commented 4 years ago

Definitely still interested in this, but it requires someone's architecting time.

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

Ekleog commented 3 years ago

Still important.

netixx commented 3 years ago

This may be of interest (as inspiration or workaround in the meantime): https://github.com/serokell/deploy-rs#magic-rollback

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

bryanasdev000 commented 3 years ago

Up.

luizberti commented 3 years ago

I had this use case at a previous employer, and would like to pitch it to the discussion:

We had a large IoT fleet, with devices in all kinds of shoddy network conditions that we didn't control, and we need things to not only "auto-upgrade" as in "the OS", but any kind of upgrade runs the risk of severing the connection with the edge node, which would require sending out a field technician to manually flash the straggled device.

Fedora IoT (based on Silverblue, which has a very similar value proposition to NixOS) offers this "healthcheck probes with automated rollback" functionality, through their own framework called greenboot which integrates with systemd.

This kind of functionality is absolutely a must for these cases where a botched update would straggle the node, and it would be amazing if NixOS had pervasive self-stabilizing functionality such as this.

Hopefully this helps elucidate the need for this feature, and inform the design of a solution

Edit: Other good references on this topic:

TUF: The Update Framework Specification
Uptane: Used to ship Over-the-Air updates to automobiles
How Android does OTA updates

stale[bot] commented 2 years ago

I marked this as stale due to inactivity. → More info

magnetophon commented 2 years ago

Still important.

Djabx commented 2 years ago

Hi,

I'm using nixos on my laptop witch I turn off when I have finished my work.

My problem with the actual autoUpgrade feature is:

if I enable reboot option, my computer may reboot when not intended,
if autoUpgrade (but no reboot) is enable (let say at 1PM), I may loose my chrome/firefox (whatever) session because the update change the "current" version.

I suppose, a simple solution for my use case would be autoUpgrade with nixos-rebuild boot --upgrade instead of nixos-rebuild switch --upgrade.

Djabx commented 2 years ago

I've done a POC here: https://github.com/NixOS/nixpkgs/pull/183307

nixos-discourse commented 4 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/ci-cd-rebuilds-via-github/36059/16

NixOS / nixpkgs

"Safe" auto-upgrades #34902