Changing network causes domains using that network to disconnect

cannedmoose commented 9 months ago

Changing the network definition for an active network used in an active domain without changing the domain causes the network to be recreated without the domain and afterwards the domain does not seem to be able to reconnect to the network.

To get the domain to reconnect I have to shut it down and start it back up again manually.

To fix I was wondering what you thought of supplying definitions as a nix objects rather than as xml files? Then nix can work out dependencies and reset domains when needed.

AshleyYakeley commented 9 months ago

I think this would make it impossible to provide definitions at build time? Also the XML is pretty established and I don't want to take that away from people.

I think ideally the module activation script would figure that out from all the XML, if possible.

AshleyYakeley commented 9 months ago

Correct behaviour for changing a network used by an active domain is to deactivate the domain, change the network, and then reactivate the domain, is that right?

AshleyYakeley commented 9 months ago

What if, whenever a network is changed or deleted, all active domains that use it are deactivated beforehand? Wouldn't that solve the problem?

cannedmoose commented 9 months ago

What if, whenever a network is changed or deleted, all active domains that use it are deactivated beforehand? Wouldn't that solve the problem?

Yeah - I think that should work. It's just the network changing underneath the active domains that seems to be the problem

AshleyYakeley commented 9 months ago

OK, the simplest solution would be to always reactivate the domain afterwards, even if that means deactivating it again when it comes time to do domains. This is because active may be null, which means don't change the active state.

AshleyYakeley commented 9 months ago

Hmm, might be safer to record the active state beforehand somehow. I will have to think about this.

cannedmoose commented 9 months ago

Did some googling and it seems that the "official" solution to arbitrary network changes is to restart the libvirtd service which doesn't sound ideal.

Not sure if there's a way to just reconnect the domains using something like brctl though - I will have a go at getting it reconnected without a domain restart when I run into the issue again.

AshleyYakeley commented 9 months ago

Restarting libvirtd doesn't actually affect active domains (which actually run under QEMU or whatever), does it?

cannedmoose commented 9 months ago

Hmm I'm not sure and have not tested it - but the network bridge itself isn't part of QEMU so libvirt has to do something already to connect it to the VM and it could do that at startup as well as on domain start.

There's a few methods people have come up with for reconnecting without VM restart though:

I'll see if I can get either of these to work plus restarting libvirtd tomorrow and let you know.

I'm experimenting with automatic static IP allocations for my domains so my network is getting changed pretty often especially while incorporating NixVirt into my config to replace some declarative Qemu VMs but I don't think it's an issue that will be run into for the common libvirt networking case.

cannedmoose commented 9 months ago

Okay so I figured out you can trigger this on every rebuild by declaring a network without a mac address - libvirt redefine then triggers a recreation even if everything else is the same.

For fixing I've tried a few methods for reconnecting and 2 worked - let me know your thoughts or if there's any other method you can think of.

Restart libvirtd

This works and is probably the simplest solution : sudo systemctl restart libvirtd.service

I think it would have to happen as the last step as restarting would close the libvirt connection... So this is slightly annoying if your connected with virt-manager.

`brctl` to reconnect

Would need to add a dependency on bridge-utils and do this per domain/network pair for every affected network but it seems like it would have the least side-effects.
Not sure how this would look in python but you should be able to pull the relevant names from the libvirt objects without too much difficulty. The steps I took to do it manually were:

# Find the network interface eg vnet01
virsh --connect qemu:///system domifaddr VM_NAME

# Find the bridge name eg virbr01
virsh --connect qemu:///system net-info NETWORK_NAME

# connect them (brctl from bridge-utils package)
sudo brctl addif BRIDGE_NAME NETWORK_NAME

Redefine the network interface in the domain (didn't work)

From this it looked like I might be able to just redefine the interface on the domain but it didn't reconnect.

virtnetworkd (didn't work)

Tried just restarting virtnetworkd which crashed libvirtd.

Hooks (could be helpful)

There are hooks for networking which could be helpful in using brctl to reconnect networks on an update. You can't call libvirtd functions from within them so I probably wouldn't use it to trigger a libvirtd restart.

AshleyYakeley commented 9 months ago

Just to clarify, what you want is to avoid domain reactivation altogether, right?

cannedmoose commented 9 months ago

Yeah now that I know it's possible my ideal would be the network can be updated without domain downtime.

Also understand this might be out of scope for what you NixVirt to manage so am happy to manage something in my own config!

AshleyYakeley commented 9 months ago

Yeah I think preventing domain reactivation is out of scope for the time being. That said, what I do want to ensure is:

correctness: after NixOS/HM activation, the running configuration matches the requested configuration
idempotency: activating with no changes to the requested configuration does not change the running configuration, and does not stop or start domains

cannedmoose commented 9 months ago

Hmm in this case I think then there might need to be some work to ensure both at once - libvirt docs say that not specifying a mac address for network definitions is recommended.

But that leads to the problem where the active config can never actually match the requested one without something extra to bridge the gap of elements that are added by libvirt and not in the original spec XML

AshleyYakeley commented 9 months ago

libvirt adds lots of elements: things like PCI addresses of devices in domains. That's actually not a problem.

NixVirt does this for all object types:

fetch the definition XML
push the requested definition XML
fetch the definition XML
compare the old and new fetched XML to see if anything has changed
if the XML has changed, and the object is active, then deactivate it (it might get reactivated later)

cannedmoose commented 9 months ago

Ah I sorry I missed the first fetch when looking over the code and thought it was just comparing what's there to what was requested.

I tested for unspecified mac addresses on network interfaces in domain definitions and they will also change mac addresses causing a reset of the VM every activation.

I wonder why libvirt doesn't give stable mac addresses... I might do some investigation when I have some time.

I guess the question is how NixVirt deals with this and whether there are any other elements with a similar problem. My thoughts are:

Give a warning if mac addresses (or similar) are not specified that the config is under specified but allow it
Don't allow configs without mac addresses or other elements that libvirt will change underneath us
Do a three way comparison with the 2 fetched and definition configs - ignore differing elements between the fetched ones that aren't specified in the original definition.

For my usages I would prefer 3 but understand it would put a lot of extra logic into the virtdeclare. Otherwise I think 1 is the way to go just because they are still valid configs but it's nice to be given a warning when you're going to shoot yourself in the foot.

cannedmoose commented 9 months ago

Ah also I just realized I was only thinking about the nix domain definitions - for the warning/not allow case. For raw XML definitions there will have to be extra logic in virtdeclare anyway to implement those...

My main use case is defining fully declarative nix vms similar to what the qemu-vm module does at the moment so the (probably more common case) of supplying an XML file as input was not in my head at all.

AshleyYakeley commented 9 months ago

OK, see #9 for the mac address issue.

AshleyYakeley commented 9 months ago

Changing the network definition for an active network used in an active domain without changing the domain causes the network to be recreated without the domain and afterwards the domain does not seem to be able to reconnect to the network.

Could you clarify what change you made to the network definition? Does the domain reconnect to the network if it requests a new DHCP lease?

AshleyYakeley commented 9 months ago

OK, I believe this is fixed in master. Now when an interface definition is changed, any domains that use it are deactivated, and reactivated (unless set to inactive). Reopen if it doesn't do this for you.

cannedmoose commented 9 months ago

Changing the network definition for an active network used in an active domain without changing the domain causes the network to be recreated without the domain and afterwards the domain does not seem to be able to reconnect to the network.

Could you clarify what change you made to the network definition? Does the domain reconnect to the network if it requests a new DHCP lease?

I realized you're probably using a bridge network and I'm using a NAT as direct ethernet bridges don't work out of the box with libvirt on wifi.

Renewing DHCP doesn't work, from the guests sides it's a dead connection, the host needs to re-establish the connection to get anything on the network for the guest.

I've uploaded example config XML here: https://github.com/cannedmoose/nixvirt_example

The VM will not run out of the box sorry as it references a bunch of stuff in my store but you should be able to just copy the network interface out of that into something that does. Or use a network snippet like this in one of your guests with the supplied network def:

interface =
        {
          type = "network";
          mac = { address = mac_address; };
          model =  { type = "virtio"; };
          source = { bridge = "virbr0"; };
        };

cannedmoose commented 9 months ago

OK, I believe this is fixed in master. Now when an interface definition is changed, any domains that use it are deactivated, and reactivated (unless set to inactive). Reopen if it doesn't do this for you.

With the above in mind it's not fixed for me but it looks like just because you're only looking for bridge interfaces to decide whether to reset but mine are type network

I can't seem to re-open, not sure I have the permission.

cannedmoose commented 9 months ago

I did a bit of hacking on alternative solutions for the mac address and reconnect issue you can see here: https://github.com/cannedmoose/NixVirt/pull/1/files

Let me know if you want me to clean up any of it to submit a pull request - the gist is:

Using xmldiff to check for changed mac addresses. If that's the only change and a mac wasn't in the spec don't count it as a change to avoid resetting the machine/network.
On network change use brctl to attempt a reconnect for affected domains instead of resetting them

There's a some missing edge cases around when to attempt a reset and I don't take bridge networks into accounts but I think they would share similar logic to the Nat based one I'm testing on.

AshleyYakeley commented 9 months ago

OK, I believe this should be fixed for you, your domain should now deactivate/reactivate when you change your network.

I've created #11 for reconnecting without deactivating instead.

cannedmoose commented 9 months ago

Still not fixed - same config as what I shared, I change the MAC address of the network and the domain does not restart and the network is disconnected but if I manually deactivate and reactivate the network reconnects.

Systemd logs for nixvirt service:

Feb 24 16:40:52 beeper systemd[1]: Starting Configure libvirt objects...
Feb 24 16:40:52 beeper nixvirt-start[1420663]: network db5e67da-a24e-4c74-8f77-d1a87d962a66: redefine
Feb 24 16:40:52 beeper nixvirt-start[1420663]: network db5e67da-a24e-4c74-8f77-d1a87d962a66: changed
Feb 24 16:40:52 beeper nixvirt-start[1420663]: network db5e67da-a24e-4c74-8f77-d1a87d962a66: deactivate (temporary)
Feb 24 16:40:52 beeper nixvirt-start[1420663]: domain 2904419d-b283-4cfd-9f2c-7c3713ff809f: redefine
Feb 24 16:40:52 beeper nixvirt-start[1420663]: domain 2904419d-b283-4cfd-9f2c-7c3713ff809f: unchanged
Feb 24 16:40:52 beeper nixvirt-start[1420663]: network db5e67da-a24e-4c74-8f77-d1a87d962a66: activate
Feb 24 16:40:52 beeper systemd[1]: nixvirt.service: Deactivated successfully.
Feb 24 16:40:52 beeper systemd[1]: Finished Configure libvirt objects.

Looks like the domain isn't picked up as a dependency. Will have some time to look deeper into next week.

cannedmoose commented 9 months ago

Just a small typo - needed to get the name element text - have opened a PR with a fix.

https://github.com/AshleyYakeley/NixVirt/pull/12

AshleyYakeley / NixVirt