canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

networking fails to start after lxd upgrade #6733

Closed grawlinson closed 4 years ago

grawlinson commented 4 years ago

Required information


# Issue description

Upgraded from lxd 3.18 to 3.19, no networking is happening inside all containers. Also upgraded from kernel 5.4.2 to 5.4.13. Reboot does not solve issue, nor does downgrading due to lxd database schema not matching.

# Information to attach

- Output of `journalctl -eu lxd`

Jan 19 17:24:59 r620 systemd[1]: Stopping REST API, command line tool and OpenStack integration plugin for LXC.... Jan 19 17:24:59 r620 systemd[1]: lxd.service: Succeeded. Jan 19 17:24:59 r620 systemd[1]: Stopped REST API, command line tool and OpenStack integration plugin for LXC.. Jan 19 17:24:59 r620 systemd[1]: Started REST API, command line tool and OpenStack integration plugin for LXC.. Jan 19 17:24:59 r620 lxd[3208709]: t=2020-01-19T17:24:59+1300 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" Jan 19 17:26:56 r620 lxd[3208709]: t=2020-01-19T17:26:56+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:26:56 r620 lxd[3208709]: t=2020-01-19T17:26:56+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:26:56 r620 lxd[3208709]: t=2020-01-19T17:26:56+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:27:06 r620 lxd[3208709]: t=2020-01-19T17:27:06+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:30:11 r620 lxd[3208709]: t=2020-01-19T17:30:11+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:12 r620 systemd[1]: Stopping LXD - main daemon... Jan 19 17:34:12 r620 lxd[3208709]: t=2020-01-19T17:34:12+1300 lvl=warn msg="Detected poll(POLLNVAL) event." Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:14 r620 lxd[3208709]: t=2020-01-19T17:34:14+1300 lvl=eror msg="Failed to remove nic filters: Failed to remove network filters for eth0: Failed to run: ebtables --concurrent -L --Lmac2 --Lx: " Jan 19 17:34:18 r620 systemd[1]: lxd.service: Succeeded. Jan 19 17:34:18 r620 systemd[1]: Stopped LXD - main daemon. Jan 19 17:34:18 r620 systemd[1]: Starting LXD - main daemon... Jan 19 17:34:18 r620 lxd[3234399]: t=2020-01-19T17:34:18+1300 lvl=warn msg="Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored." Jan 19 17:34:19 r620 lxd[3234399]: t=2020-01-19T17:34:19+1300 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: schema version '23' is more recent than expected '18'" Jan 19 17:34:19 r620 lxd[3234399]: Error: failed to open cluster database: failed to ensure schema: schema version '23' is more recent than expected '18' Jan 19 17:34:19 r620 systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE Jan 19 17:44:18 r620 lxd[3234400]: Error: LXD still not running after 600s timeout (Get http://unix.socket/1.0: EOF) Jan 19 17:44:18 r620 systemd[1]: lxd.service: Control process exited, code=exited, status=1/FAILURE Jan 19 17:44:18 r620 systemd[1]: lxd.service: Failed with result 'exit-code'. Jan 19 17:44:18 r620 systemd[1]: Failed to start LXD - main daemon. Jan 19 17:44:18 r620 systemd[1]: lxd.service: Scheduled restart job, restart counter is at 1.


 - [ ] Any relevant kernel output (`dmesg`)
 - [x] Container log (`lxc info NAME --show-log`)

$ lxc info ns1 --show-log Name: ns1 Location: none Remote: unix:// Architecture: x86_64 Created: 2020/01/19 06:18 UTC Status: Running Type: container Profiles: vlan40, limit Pid: 26812 Ips: eth0: inet6 fe80::220:91ff:fe00:4004 vethe68e899c lo: inet 127.0.0.1 lo: inet6 ::1 Resources: Processes: 93 Disk usage: root: 198.87MB CPU usage: CPU usage (in seconds): 3 Memory usage: Memory (current): 133.46MB Network usage: eth0: Bytes received: 8.48kB Bytes sent: 1.29kB Packets received: 126 Packets sent: 16 lo: Bytes received: 78B Bytes sent: 78B Packets received: 2 Packets sent: 2

Log:

lxc ns1 20200120083531.117 WARN cgfsng - cgroups/cgfsng.c:chowmod:1527 - No such file or directory - Failed to chown(/sys/fs/cgroup/unified//lxc.payload/ns1/memory.oom.group, 65536, 0)

 - [x] Container configuration (`lxc config show NAME --expanded`)

$ lxc config show ns1 --expanded architecture: x86_64 config: image.architecture: x86_64 image.description: base deploy image image.name: archlinux-2019.12.01-x86_64-default-20191210_0310 image.os: archlinux image.release: 2019.12.01 image.serial: "20191210_0310" image.variant: default limits.cpu: "1" limits.memory: 512MB volatile.base_image: cb4fe4d26007ee3aa9532b478202d00c120b08d8b9ee93c2ef09d14fc4dc1d15 volatile.eth0.host_name: vethe68e899c volatile.eth0.hwaddr: 00:20:91:00:40:04 volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]' volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]' volatile.last_state.power: RUNNING devices: eth0: name: eth0 nictype: bridged parent: br-vlan40 type: nic root: path: / pool: zfs-zetta type: disk ephemeral: false profiles:

location: none metadata: context: ip: '@' method: GET url: /1.0 user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:34.215238556+13:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /1.0/events user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:34.220079873+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'New event listener: 0b094a4e-9990-463b-8508-0db1039e5e53' timestamp: "2020-01-20T21:45:34.220485048+13:00" type: logging

location: none metadata: context: ip: '@' method: PUT url: /1.0/instances/ns1/state user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:34.221016701+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'New task Operation: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae' timestamp: "2020-01-20T21:45:34.230718469+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Started task operation: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae' timestamp: "2020-01-20T21:45:34.231185402+13:00" type: logging

location: none metadata: class: task created_at: "2020-01-20T21:45:34.226029519+13:00" description: Restarting container err: "" id: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae location: none may_cancel: false metadata: null resources: containers:

location: none metadata: class: task created_at: "2020-01-20T21:45:34.226029519+13:00" description: Restarting container err: "" id: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae location: none may_cancel: false metadata: null resources: containers:

location: none metadata: context: action: shutdown created: 2020-01-19 19:18:59.929131886 +1300 NZDT ephemeral: "false" name: ns1 project: default timeout: -1s used: 2020-01-20 21:43:57.825480063 +1300 NZDT level: info message: Shutting down container timestamp: "2020-01-20T21:45:34.232054975+13:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /1.0/operations/4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:34.233865875+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Scheduler: network: eth0 has been added: updating network priorities' timestamp: "2020-01-20T21:45:34.729395819+13:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /internal/containers/59/onstopns?target=stop&netns=/proc/33113/fd/4 user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:34.789692489+13:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /internal/containers/59/onstop?target=stop user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:35.581790144+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: UnmountInstance started timestamp: "2020-01-20T21:45:35.589049795+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: UnmountInstance finished timestamp: "2020-01-20T21:45:36.701751617+13:00" type: logging

location: none metadata: context: action: shutdown created: 2020-01-19 19:18:59.929131886 +1300 NZDT ephemeral: "false" name: ns1 project: default timeout: -1s used: 2020-01-20 21:43:57.825480063 +1300 NZDT level: info message: Shut down container timestamp: "2020-01-20T21:45:36.928250564+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Scheduler: container ns1 stopped: re-balancing' timestamp: "2020-01-20T21:45:36.928296237+13:00" type: logging

location: none metadata: action: container-shutdown source: /1.0/containers/ns1 timestamp: "2020-01-20T21:45:36.928258395+13:00" type: lifecycle

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance started timestamp: "2020-01-20T21:45:36.950761824+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Scheduler: network: vethea2222ee has been added: updating network priorities' timestamp: "2020-01-20T21:45:36.983636689+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance finished timestamp: "2020-01-20T21:45:36.98584044+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: UpdateInstanceBackupFile started timestamp: "2020-01-20T21:45:36.986379696+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Scheduler: network: veth17fa5ab6 has been added: updating network priorities' timestamp: "2020-01-20T21:45:36.995106605+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: UpdateInstanceBackupFile finished timestamp: "2020-01-20T21:45:36.995416367+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance started timestamp: "2020-01-20T21:45:36.99566445+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance finished timestamp: "2020-01-20T21:45:36.999972461+13:00" type: logging

location: none metadata: context: action: start created: 2020-01-19 19:18:59.929131886 +1300 NZDT ephemeral: "false" name: ns1 project: default stateful: "false" used: 2020-01-20 21:43:57.825480063 +1300 NZDT level: info message: Starting container timestamp: "2020-01-20T21:45:37.000007677+13:00" type: logging

location: none metadata: context: ip: '@' method: GET url: /internal/containers/59/onstart user: "" level: dbug message: Handling timestamp: "2020-01-20T21:45:37.04782919+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance started timestamp: "2020-01-20T21:45:37.051869394+13:00" type: logging

location: none metadata: context: driver: zfs instance: ns1 pool: zfs-zetta project: default level: dbug message: MountInstance finished timestamp: "2020-01-20T21:45:37.053302694+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Scheduler: container ns1 started: re-balancing' timestamp: "2020-01-20T21:45:37.063202693+13:00" type: logging

location: none metadata: action: container-started source: /1.0/containers/ns1 timestamp: "2020-01-20T21:45:37.329469254+13:00" type: lifecycle

location: none metadata: context: action: start created: 2020-01-19 19:18:59.929131886 +1300 NZDT ephemeral: "false" name: ns1 project: default stateful: "false" used: 2020-01-20 21:43:57.825480063 +1300 NZDT level: info message: Started container timestamp: "2020-01-20T21:45:37.329448597+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Success for task operation: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae' timestamp: "2020-01-20T21:45:37.329499518+13:00" type: logging

location: none metadata: class: task created_at: "2020-01-20T21:45:34.226029519+13:00" description: Restarting container err: "" id: 4bf7b3d0-ba9d-4a2f-871f-bfc2826a1cae location: none may_cancel: false metadata: null resources: containers:

location: none metadata: context: {} level: dbug message: 'Event listener finished: 0b094a4e-9990-463b-8508-0db1039e5e53' timestamp: "2020-01-20T21:45:37.331990754+13:00" type: logging

location: none metadata: context: {} level: dbug message: 'Disconnected event listener: 0b094a4e-9990-463b-8508-0db1039e5e53' timestamp: "2020-01-20T21:45:37.332517774+13:00" type: logging



I would appreciate any assistance as my entire infrastructure is down, and I'm extremely sleep deprived so I'm probably missing something obvious.
tomponline commented 4 years ago

@grawlinson can you confirm you have the ebtables command installed?

stgraber commented 4 years ago

You'll want to confirm some of the basics:

stgraber commented 4 years ago

Also how is that LXD installed exactly? The journal output suggests that you upgraded to a new LXD and then downgraded it which LXD detected and refused starting.

grawlinson commented 4 years ago

I didn't need ebtables installed with lxd 3.18, only installed it with 3.19 when it showed up in the logs.

I don't have dnsmasq running due to all my containers having static IPs, and others obtaining IP addresses via router DHCP. It has worked this way from lxd 3.14, and only ran into this particular issue when upgrading to 3.19.

Chain FORWARD (policy ACCEPT 19 packets, 1434 bytes) pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 7592 packets, 3180K bytes) pkts bytes target prot opt in out source destination


LXD is installed using [this](https://github.com/Foxboron/lxd-repo/) repository. I upgraded from 3.18 to 3.19, back to 3.18 and now running 3.19.
grawlinson commented 4 years ago

I may have found a cause. Restarting a container results in this showing up in logs. Will do some more poking around.

Jan 21 08:19:51 r620 systemd-networkd[1172]: veth52060197: rtnl: received neighbor message with invalid family, ignoring.
Jan 21 08:19:51 r620 systemd-networkd[1172]: veth52060197: rtnl: received neighbor message with invalid family, ignoring.
stgraber commented 4 years ago

Oh, hold on, did you also upgrade your containers? And are those running ArchLinux too?

If so and you're also using networkd in those containers, then you just got hit by the systemd pointrelease regression in networkd.

stgraber commented 4 years ago

If that's the case, then your way out of this issue is to apply the same fix we have in our current images: https://github.com/lxc/lxc-ci/blob/master/images/archlinux.yaml#L470

grawlinson commented 4 years ago

Yup, that was the problem. Do you have a reference/article regarding the regression that I could peruse?

I really appreciate the help, @stgraber & @tomponline. I think that's two beverages I owe you now. :)

stgraber commented 4 years ago

https://github.com/systemd/systemd-stable/commit/96d7083c5499b264ecebd6a30a92e0e8fda14cd5

stgraber commented 4 years ago

This change is what broke things. A proper fix will need a kernel change, until then we need this commit reverted by distros or the workaround be put in place.