canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

OVN NIC goes down after changing a device setting #11317

Closed simondeziel closed 1 year ago

simondeziel commented 1 year ago

Required information

Issue description

On an instance hooked to an OVN network, changing a config key on the NIC causes the NIC to fall 'DOWN':

Steps to reproduce

  1. Check the state of eth0:
    $ lxc info hdc:mx2 | grep -A 2 'eth0:$'
    eth0:
      Type: broadcast
      State: UP
  2. Change a NIC config:
    lxc config device set hdc:mx2 eth0 ipv4.routes 172.24.0.0/16
  3. Check the state of eth0:
    $ lxc info hdc:mx2 | grep -A 2 'eth0:$'
    eth0:
      Type: broadcast
      State: DOWN

Note: unset'ing the config key doesn't bring the NIC back 'UP'.

Additional information:

Instance config:

$ lxc config show hdc:mx2
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20230112_07:42)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20230112_07:42"
  image.type: squashfs
  image.variant: default
  limits.memory: 1GiB
  volatile.base_image: e6d116cfb7844ffcf68dc48049e94f70c96f55cf08bd0b7901184b4481395d4b
  volatile.cloud-init.instance-id: b0e4d78a-5859-40b1-b93f-71c858b37f46
  volatile.eth0.host_name: vethb412afdc
  volatile.eth0.hwaddr: 00:16:3e:17:18:07
  volatile.idmap.base: "589824"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":589824,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":589824,"Nsid":0,"Maprange":65536}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":589824,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":589824,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 0cdd45c4-2271-4273-a62e-92fdeb6f712d
devices:
  eth0:
    ipv4.routes: 172.24.0.0/16
    ipv4.routes.external: 192.0.2.1/32
    name: eth0
    network: default
    security.acls: gw-hdc
    type: nic
ephemeral: false
profiles:
- default
stateful: false
description: ""

Network definition:

$ lxc network show hdc:default
config: {}
description: Default network (no DHCP because only the /32 is statically assigned)
name: default
type: ovn
used_by:
- /1.0/instances/mx2?project=sdeziel
- /1.0/profiles/default?project=sdeziel
managed: true
status: Created
locations:
- abydos
- langara
- orilla

Profile:

$ lxc profile show hdc:default
config:
  limits.cpu: "1"
  limits.memory: 320MiB
  limits.processes: "500"
  security.devlxd: "false"
  security.idmap.isolated: "true"
  security.nesting: "true"
  security.privileged: "false"
  security.protection.delete: "true"
  security.syscalls.deny_compat: "true"
  snapshots.expiry: 3d
  snapshots.schedule: '@daily, @startup'
description: Hardening
devices:
  eth0:
    name: eth0
    network: default
    type: nic
  root:
    path: /
    pool: ssd
    size: 4GiB
    type: disk
name: default
used_by:
- /1.0/instances/mx2?project=sdeziel
tomponline commented 1 year ago

Is this for a container or a VM?

I just tried a similar thing here and couldn't reproduce it. You're correct that modifying ipv4.routes.* will a cause the NIC to be removed and then re-added, as these particular settings (currently) are not considered "live updatable" (this is because the logic is all encapsulated inside the OVN port setup function).

But on re-adding the guest should detect the new eth0 interface and bring it up, at least it does on my test.

lxc network show ovn1
config:
  bridge.mtu: "1500"
  ipv4.address: 10.29.208.1/24
  ipv4.nat: "true"
  ipv6.address: fd42:5747:891:c3b9::1/64
  ipv6.nat: "true"
  network: lxdbr0
  volatile.network.ipv4.address: 10.21.203.11
  volatile.network.ipv6.address: fd42:ffdb:caff:baf7:216:3eff:febe:538c
description: ""
name: ovn1
type: ovn
used_by: []
managed: true
status: Created

lxc launch images-us:ubuntu/jammy c1 -n ovn1
lxc ls c1
+------+---------+--------------------+----------------------------------------------+-----------+-----------+
| NAME |  STATE  |        IPV4        |                     IPV6                     |   TYPE    | SNAPSHOTS |
+------+---------+--------------------+----------------------------------------------+-----------+-----------+
| c1   | RUNNING | 10.29.208.2 (eth0) | fd42:5747:891:c3b9:216:3eff:fe27:7aaa (eth0) | CONTAINER | 0         |
+------+---------+--------------------+----------------------------------------------+-----------+-----------+

lxc config device set c1 eth0 ipv4.routes=10.0.0.1/32
lxc ls c1
+------+---------+--------------------+----------------------------------------------+-----------+-----------+
| NAME |  STATE  |        IPV4        |                     IPV6                     |   TYPE    | SNAPSHOTS |
+------+---------+--------------------+----------------------------------------------+-----------+-----------+
| c1   | RUNNING | 10.29.208.2 (eth0) | fd42:5747:891:c3b9:216:3eff:fe27:7aaa (eth0) | CONTAINER | 0         |
+------+---------+--------------------+----------------------------------------------+-----------+-----------+
simondeziel commented 1 year ago

Is this for a container or a VM?

hdc:mx2 is a Ubuntu container running off of Stéphane's cluster (the one at Hive DC).

tomponline commented 1 year ago

If there is no DHCP, how is the network interface configured inside the instance?

simondeziel commented 1 year ago

Static IPv4 and SLAAC for IPv6:

$ lxc exec hdc:mx2 -- grep -v '#' /etc/netplan/10-lxc.yaml
network:
  version: 2
  ethernets:
    eth0:
      addresses:
        - 45.45.148.177/32
      routes:
        - to: 0.0.0.0/0
          via: 10.121.160.1
          on-link: true
tomponline commented 1 year ago

And if you bring the interface up manually inside the guest does it work again? And is it up on the host side?

simondeziel commented 1 year ago

When I do: lxc config device set hdc:mx2 eth0 ipv4.routes 172.24.0.0/16, I do see eth0 vanishing from the container and popping back but in DOWN state. Then running lxc exec hdc:mx2 -- ip link set eth0 up only restores the IPv6 thanks to SLAAC.

I have no visibility on the host side of things but Stéphane does.

tomponline commented 1 year ago

OK I tried this with Alpine connected to a bridge network and changed queue.tx.length, which would similar cause the device to be removed and re-added, and it exhibited the same behaviour. I think this is probably to do more with the network managed of the device inside the instance. Apparently when using DHCP it will reactivate the interface when it has been re-added and bring it back up.

tomponline commented 1 year ago

If bringing the device up manually works for SLAAC, that means the actual network connection is running, but the network config isn't being reapplied by the instance's operating system.

tomponline commented 1 year ago

I just tried images:ubuntu/jammy connected to an OVN network with this netplan config:

network:
  version: 2
  ethernets:
    eth0:
      addresses:
        - 10.29.208.2/32
      routes:
        - to: 0.0.0.0/0
          via: 10.29.208.1
          on-link: true

And changing the ipv4.routes setting still gets the IP re-applied.

After fresh boot:

networkctl status
●        State: routable                                     
  Online state: online                                       
       Address: 10.29.208.2 on eth0
                fd42:5747:891:c3b9:216:3eff:fe4f:71b8 on eth0
                fe80::216:3eff:fe4f:71b8 on eth0
       Gateway: 10.29.208.1 on eth0
                fe80::216:3eff:febe:538c on eth0

Jan 30 15:18:30 c1 systemd[1]: Starting Network Configuration...
Jan 30 15:18:30 c1 systemd-networkd[84]: Failed to increase receive buffer size for general netlink sock>
Jan 30 15:18:30 c1 systemd-networkd[84]: Failed to increase buffer size for device monitor, ignoring: Op>
Jan 30 15:18:30 c1 systemd-networkd[84]: eth0: Link UP
Jan 30 15:18:30 c1 systemd-networkd[84]: eth0: Gained carrier
Jan 30 15:18:30 c1 systemd-networkd[84]: lo: Link UP
Jan 30 15:18:30 c1 systemd-networkd[84]: lo: Gained carrier
Jan 30 15:18:30 c1 systemd-networkd[84]: Enumeration completed
Jan 30 15:18:30 c1 systemd[1]: Started Network Configuration.
Jan 30 15:18:32 c1 systemd-networkd[84]: eth0: Gained IPv6LL

Then:

lxc config device set c1 eth0 ipv4.routes=10.0.0.1/32

After

lxc shell c1
root@c1:~# networkctl status
●        State: routable                                     
  Online state: online                                       
       Address: 10.29.208.2 on eth0
                fd42:5747:891:c3b9:216:3eff:fe4f:71b8 on eth0
                fe80::216:3eff:fe4f:71b8 on eth0
       Gateway: 10.29.208.1 on eth0
                fe80::216:3eff:febe:538c on eth0

Jan 30 15:18:30 c1 systemd-networkd[84]: Enumeration completed
Jan 30 15:18:30 c1 systemd[1]: Started Network Configuration.
Jan 30 15:18:32 c1 systemd-networkd[84]: eth0: Gained IPv6LL
Jan 30 15:19:05 c1 systemd-networkd[84]: eth0: Link DOWN
Jan 30 15:19:05 c1 systemd-networkd[84]: eth0: Lost carrier
Jan 30 15:19:05 c1 systemd-networkd[84]: eth0: DHCPv6 lease lost
Jan 30 15:19:05 c1 systemd-networkd[84]: veth6ba4063e: Interface name change detected, renamed to eth0.
Jan 30 15:19:05 c1 systemd-networkd[84]: eth0: Link UP
Jan 30 15:19:05 c1 systemd-networkd[84]: eth0: Gained carrier
Jan 30 15:19:07 c1 systemd-networkd[84]: eth0: Gained IPv6LL

So I cannot reproduce the issue with netplan.

simondeziel commented 1 year ago

I always remove the networkd-dispatcher package, maybe it does something?

Here is the journalctl output I got after a fresh reboot:

root@mx2:~# journalctl -fu systemd-networkd
Jan 30 15:24:46 mx2 systemd-networkd[67]: lo: Gained carrier
Jan 30 15:24:46 mx2 systemd-networkd[67]: Enumeration completed
Jan 30 15:24:46 mx2 systemd[1]: Started Network Configuration.
Jan 30 15:24:46 mx2 systemd-networkd[67]: eth0: Gained IPv6LL
Jan 30 15:24:48 mx2 systemd-networkd[67]: wg-home: Link UP
Jan 30 15:24:48 mx2 systemd-networkd[67]: wg-home: Gained carrier

# $ lxc config device unset hdc:mx2 eth0 ipv4.routes

Jan 30 15:26:22 mx2 systemd-networkd[67]: eth0: Link DOWN
Jan 30 15:26:22 mx2 systemd-networkd[67]: eth0: Lost carrier
Jan 30 15:26:22 mx2 systemd-networkd[67]: eth0: DHCPv6 lease lost
Jan 30 15:26:24 mx2 systemd-networkd[67]: veth87dd6c1c: Interface name change detected, renamed to eth0.

# $ lxc exec hdc:mx2 -- ip link set eth0 up

Jan 30 15:27:41 mx2 systemd-networkd[67]: eth0: Link UP
Jan 30 15:27:41 mx2 systemd-networkd[67]: eth0: Gained carrier
Jan 30 15:27:42 mx2 systemd-networkd[67]: eth0: Gained IPv6LL

Then I only have the IPv6:

root@mx2:~# ip a show dev eth0
80: eth0@if81: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:17:18:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 2602:fc62:ff:1000:5e08:c4c:6809:30a/64 scope global temporary dynamic 
       valid_lft 604628sec preferred_lft 85978sec
    inet6 2602:fc62:ff:1000:216:3eff:fe17:1807/64 scope global dynamic mngtmpaddr 
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fe17:1807/64 scope link 
       valid_lft forever preferred_lft forever
simondeziel commented 1 year ago

Also, running netplan apply fixes it for me, of course.

tomponline commented 1 year ago

Can you test it with a vanilla container without modification, to check if its something that has been added/removed/changed in your case?

simondeziel commented 1 year ago

Just created c1 (fresh images:ubuntu/jammy container) and isolated the problem to be disabling systemd-udevd in the container:

root@c1:~# systemctl mask --now systemd-udevd.service
Created symlink /etc/systemd/system/systemd-udevd.service → /dev/null.

Since the NIC is yanked out and plugged back in, it sounds legitimate to need udevd to react to the change. I don't understand why the NIC shows as DOWN once plugged back in though. Probably not LXD's fault but my own so closing.

Thanks Tom and sorry for the noise.

tomponline commented 1 year ago

Ah glad u found the issue.