AshleyYakeley / NixVirt

LibVirt domain management for Nix
MIT License
192 stars 21 forks source link

Add connection option `deleteOldIfChanged` and support non-destructive reloading of `nixvirt.service` #51

Closed 3442 closed 1 month ago

3442 commented 3 months ago

Right now, nixvirt-module-helper deletes all libvirt objects known to it and then recreates them. This results in domains getting suddenly shutdown and restarted if nixvirt.service changes in any way after a NixOS configuration switch.

With this PR, we now reload (rather than restart) nixvirt.service if the unit file changes due to nixos-rebuild switch and allow controlling the object deletion behavior per connection according to user preference. Default behavior (deleteOldIfChanged = true) is backwards compatible.

nixvirt.service is now marked reloadIfChanged = true (implying restartIfChanged = false) and RemainAfterExit=yes (in order for reloads to work). If deleteOldIfChanged = true for a given connection URI, both restarts and reloads will behave the same as restarts did before this change.

Currently, deleteOldIfChanged = false does not have any effect in Home Manager, since nixvirt-module-helper is run as an activation script. It will only work if NixVirt is used as a NixOS module.

AshleyYakeley commented 3 months ago

OK, I'm trying to figure this out...

nixos-rebuild switch can cause the nixvirt systemd service unit file to change, and in that event, objects such as domains should not be deleted and recreated, because their definitions have not changed, is that correct?

How does nixos-rebuild switch cause the nixvirt.service unit file to change? Is it because parts of it now point to different derivations in the nix store?

3442 commented 3 months ago

nixos-rebuild switch can cause the nixvirt systemd service unit file to change, and in that event, objects such as domains should not be deleted and recreated, because their definitions have not changed, is that correct?

Yeah, that's right, this is the default behavior for any enabled systemd unit. If the unit has changed by pointing to different derivations in the store, then nixos-rebuild will restart it. In the case of nixvirt.service, this means running the module helper again, and thus deleting every object managed by nixvirt, including domains and networks. I see several different scenarios on which this might be undesirable:

All of this happens because the dependencies of nixvirt.service change in some way (usually the json file passed to the module helper). The solution is setting restartIfChanged = false for the service. In nixpkgs, libvirtd.service is already setup this way due to similar reasons (see NixOS/nixpkgs/commit/15351c478046935ae4a5504d492c7db1ad0fa31e). We still want the script to run after a rebuild, so we set reloadIfChanged = true and don't pass --delete-old to the script in the case of a reload.

How does nixos-rebuild switch cause the nixvirt.service unit file to change? Is it because parts of it now point to different derivations in the nix store?

Exactly

AshleyYakeley commented 3 months ago
  1. When the service reloads, should nixvirt-module-helper be called at all?

  2. nixvirt-module-helper doesn't delete an object if it knows it hasn't changed. But for some domains, the definition looks different even when it isn't, and so NVMH deletes it and recreates it anyway. Is this what is happening here? You can switch on verbose to see the XML diff.

3442 commented 3 months ago
  1. When the service reloads, should nixvirt-module-helper be called at all?

Yes. If we don't call the module helper on reload, then we have two other options for restart:

  1. Preserve the current restart behavior, possibly impacting running VMs/containers.
  2. "Fix" restart. This still forces domain shutdowns at rebuild time if dom definitions change, because that's what the module helper does, even if undesirable to the user.

In my opinion, it is okay for restart to behave the way it does now (in some cases), as long as there's a way to disable the destroy+deactivate+recreate sequence.

2. `nixvirt-module-helper` doesn't delete an object if it knows it hasn't changed. But for some domains, the definition looks different even when it isn't, and so NVMH deletes it and recreates it anyway. Is this what is happening here? You can switch on `verbose` to see the XML diff.

This is what I get with verbose without changing anything else:

nixvirt-reload[14276]: Examining settings
nixvirt-reload[14276]: domain: reading /nix/store/...-NixVirt-domain-vm
nixvirt-reload[14276]: domain 127c0303-a577-42c0-b640-febb23fa06d1: assigning MAC address 52:54:00:11:04:4f
nixvirt-reload[14276]: Deleting old objects
nixvirt-reload[14276]: Defining new objects
nixvirt-reload[14276]: domain 127c0303-a577-42c0-b640-febb23fa06d1: redefine
nixvirt-reload[14276]: domain 127c0303-a577-42c0-b640-febb23fa06d1: changed:
nixvirt-reload[14276]: [move, /domain/devices/controller[2], /domain/devices[1], 8]
nixvirt-reload[14276]: [move, /domain/devices/controller[2], /domain/devices[1], 8]
nixvirt-reload[14276]: domain 127c0303-a577-42c0-b640-febb23fa06d1: deactivate
nixvirt-reload[14276]: Defining pool volumes
nixvirt-reload[14276]: Setting activation state
nixvirt-reload[14276]: domain 127c0303-a577-42c0-b640-febb23fa06d1: activate
nixvirt-reload[14276]: Done

The dom then reboots immediately. According to these log it does get deactivated, redefined and reactivated. I think maybe you're right in that it doesn't get destroyed? Not sure about that, but the end result is the same: unsolicited changes to the dom status. Even if I had change the dom's definition, I still want to prevent the helper from deactivating it right after rebuild. Also, libvirt is able to change an object's definition dynamically (e.g. virsh attach-device, filling unspecified target PCI addresses, automatically inserting controllers, etc). The [move]s in the diff may be related to that.

Regarding my use case, I expect to be able to disable the helper's ability to shutdown domains. Even if the helper were able to detect unchanged configs perfectly, I prefer to manage my VMs' power-off manually (or by triggering a nixvirt.service restart in some way). However, the current behavior leads to minor unplanned service interruptions.

AshleyYakeley commented 3 months ago

Regarding my use case, I expect to be able to disable the helper's ability to shutdown domains. Even if the helper were able to detect unchanged configs perfectly, I prefer to manage my VMs' power-off manually (or by triggering a nixvirt.service restart in some way). However, the current behavior leads to minor unplanned service interruptions.

OK, while I would like to improve NixVirt's ability to detect config changes regardless, I can understand the desirability of allowing disabling of domain deactivations.

Regarding this:

                  reload = script { reload = true; };
                  script = script { reload = false; };

Under what circumstances would script be called instead of reload?

3442 commented 3 months ago

Regarding this:

                  reload = script { reload = true; };
                  script = script { reload = false; };

Under what circumstances would script be called instead of reload?

script is the service's ExecStart (called by systemctl start/restart), while reload becomes ExecReload (called by systemctl reload). script will run after system boot and any time the unit is started after being stopped (stopping the unit is a nop for now, it just changes the unit's state from active to inactive). reload will run only if the service is explicitly reloaded, either manually by the user or automatically by the NixOS activation script. However, note that if deleteOldIfChanged = true for all libvirt connections, then script and reload will perform the same operation and both ExecStart/ExecReload will point to the same store path, this is equivalent to current behavior.

Here's how nixvirt.service looks like with deleteOldIfChanged = false:

[Unit]
After=libvirtd.service
Description=Configure libvirt objects
Requires=libvirtd.service

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/naa6s54b4hc55740lblbs9phv7xk317h-glibc-locales-2.39-52/lib/locale/locale-archive"
Environment="PATH=/nix/store/bsc6xqbajrb1dxg8ck5vpqdshvkpqmvc-qemu-utils-8.2.4/bin:/nix/store/6rc2wy949pzd5bq2ni8r1psfl6dnz19h-swtpm-0.8.2/bin:/nix/store/php4qidg2bxzmm79vpri025bqi0fa889-coreutils-9.5/bin:/nix/store/jjcsr5gs4qanf7ln5c6wgcq4sn75a978-findutils-4.9.0/bin:/nix/store/28gpmx3z6ss3znd7fhmrzmvk3x5lnfbk-gnugrep-3.11/bin:/nix/store/5zjms21vpxlkbc0qyl5pmj2sidfmzmd7-gnused-4.9/bin:/nix/store/xjiifrz7ha6s29gp0p0j3w0155phxmia-systemd-255.6/bin:/nix/store/bsc6xqbajrb1dxg8ck5vpqdshvkpqmvc-qemu-utils-8.2.4/sbin:/nix/store/6rc2wy949pzd5bq2ni8r1psfl6dnz19h-swtpm-0.8.2/sbin:/nix/store/php4qidg2bxzmm79vpri025bqi0fa889-coreutils-9.5/sbin:/nix/store/jjcsr5gs4qanf7ln5c6wgcq4sn75a978-findutils-4.9.0/sbin:/nix/store/28gpmx3z6ss3znd7fhmrzmvk3x5lnfbk-gnugrep-3.11/sbin:/nix/store/5zjms21vpxlkbc0qyl5pmj2sidfmzmd7-gnused-4.9/sbin:/nix/store/xjiifrz7ha6s29gp0p0j3w0155phxmia-systemd-255.6/sbin"
Environment="TZDIR=/nix/store/y6hmqbmbwq0rmx1fzix5c5jszla2pzmp-tzdata-2024a/share/zoneinfo"
X-ReloadIfChanged=true
ExecReload=/nix/store/igal4hdkc9wj5gbjf8hnyl6xgap4f3ry-unit-script-nixvirt-reload/bin/nixvirt-reload
ExecStart=/nix/store/zb4b8drrindrj8dgxp45y4gaf0bmmpgj-unit-script-nixvirt-start/bin/nixvirt-start 
RemainAfterExit=true
Type=oneshot

[Install]
WantedBy=multi-user.target

X-ReloadIfChanged=true tells the activation script to reload rather than restart the unit if the service's derivation has changed (X- settings are ignored by systemd, they're only understood by NixOS activation). RemainAfterExit=true makes the service reloadable, even if it has type oneshot, by keeping it active after script has exited. It also opens up the possibility for NixVirt to implement some kind of shutdown/ExecStop script in the future.

AshleyYakeley commented 2 months ago

I've added a per-object restart flag that controls whether to restart (i.e., deactivate and reactivate) an object during NixOS/HM activation. It's either true or false or null, if null it attempts to detect whether the definition has changed. This detection has also been improved in the case of domains.

Not sure if this PR is still needed?

3442 commented 2 months ago

I've added a per-object restart flag that controls whether to restart (i.e., deactivate and reactivate) an object during NixOS/HM activation. It's either true or false or null, if null it attempts to detect whether the definition has changed. This detection has also been improved in the case of domains.

Not sure if this PR is still needed?

Hi, sorry for disappearing, I've been pretty busy lately and forgot about this.

I think the new restart flag might do it for me, I'll let you know asap and close the PR if that's the case.