flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
702 stars 30 forks source link

[Maybe Bug] possible link failure after switch reboot #659

Closed schmitch closed 2 years ago

schmitch commented 2 years ago

Description

currently we do run 3x Dell Servers 7525 (AMD EPYC 7313) with Mellanox ConnectX-5 EN 25GbE Dual-port SFP28 Adapter's and also some Meraki Switches, this is a test cluster that should be going live in a month or so, inside a much more volatile network. We use cilium with hardware offloading (dsr, native acceleration)/kube proxy disabled/kube-vip on a k3s 1.23.4 and this is a baremetal install with static ip addresses. and of course the latest stable version: Flatcar Container Linux by Kinvolk stable 3033.2.2

This night we had a scheduled maintanence on our switch and after that the network connectivity was down on all nic's (we run on a single port). The ports were online according to ip addr, however even ping on the same ip address that the host uses returned 100% connection loss (we could connect over idrac, dells management engine).

We have no idea why that was happening, because I rebooted the nodes before inspecting it, because I did not know of the switch reboot. And I found it extremly wierd, that I could not ping the machines static ip address, which basically never happend in my life before.

Impact

no connectivity

Expected behavior

get connectivity back after a short while

Additional information

we will retry the switch reboot experiment today

I have a systemd log but in the log it only says:

Mar 03 06:26:13 rancher1 kernel: mlx5_core 0000:63:00.0 eno12399np0: Link down
Mar 03 06:26:13 rancher1 systemd-networkd[1496]: eno12399np0: Lost carrier
Mar 03 06:26:13 rancher1 systemd-timesyncd[1612]: No network connectivity, watching for changes.
Mar 03 06:27:40 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:27:40 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:27:40 rancher1 systemd-udevd[927123]: Using default interface naming scheme 'v249'.
Mar 03 06:27:40 rancher1 systemd-udevd[927124]: Using default interface naming scheme 'v249'.
Mar 03 06:27:40 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:27:40 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:27:40 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:27:41 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:27:45 rancher1 kernel: mlx5_core 0000:63:00.0 eno12399np0: Link up
Mar 03 06:27:45 rancher1 systemd-networkd[1496]: eno12399np0: Gained carrier
Mar 03 06:27:45 rancher1 systemd-timesyncd[1612]: Network configuration changed, trying to establish connection.
Mar 03 06:28:06 rancher1 systemd-resolved[1604]: Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 1.1.1.1.
Mar 03 06:28:11 rancher1 systemd-resolved[1604]: Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 8.8.4.4.
Mar 03 06:28:12 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:28:12 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:28:12 rancher1 systemd-udevd[927500]: Using default interface naming scheme 'v249'.
Mar 03 06:28:12 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:28:12 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:28:12 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:28:13 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:28:27 rancher1 systemd-resolved[1604]: Using degraded feature set TCP instead of UDP for DNS server 1.1.1.1.
Mar 03 06:28:37 rancher1 systemd-resolved[1604]: Using degraded feature set TCP instead of UDP for DNS server 8.8.4.4.
Mar 03 06:28:45 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:28:45 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:28:45 rancher1 systemd-udevd[927850]: Using default interface naming scheme 'v249'.
Mar 03 06:28:45 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:28:45 rancher1 systemd-udevd[927849]: Using default interface naming scheme 'v249'.
Mar 03 06:28:45 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:28:45 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:28:47 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:29:19 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:29:19 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:29:19 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:29:19 rancher1 systemd-udevd[928429]: cilium: Could not set offload features, ignoring: No such device
Mar 03 06:29:19 rancher1 systemd-udevd[928429]: Using default interface naming scheme 'v249'.
Mar 03 06:29:19 rancher1 systemd-udevd[928428]: Using default interface naming scheme 'v249'.
Mar 03 06:29:19 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:29:19 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:29:20 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:29:29 rancher1 systemd-resolved[1604]: Using degraded feature set UDP instead of TCP for DNS server 1.1.1.1.
Mar 03 06:29:34 rancher1 systemd-resolved[1604]: Using degraded feature set UDP instead of TCP for DNS server 8.8.4.4.
Mar 03 06:29:54 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:29:54 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:29:54 rancher1 systemd-udevd[928717]: Using default interface naming scheme 'v249'.
Mar 03 06:29:54 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:29:54 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:29:54 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:29:56 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:30:30 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:30:30 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier
Mar 03 06:30:30 rancher1 systemd-timesyncd[1612]: Network configuration changed, trying to establish connection.
Mar 03 06:30:30 rancher1 systemd-udevd[929213]: Using default interface naming scheme 'v249'.
Mar 03 06:30:30 rancher1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Mar 03 06:30:30 rancher1 systemd-networkd[1496]: lxc_health: Link UP
Mar 03 06:30:30 rancher1 systemd-networkd[1496]: lxc_health: Gained carrier
Mar 03 06:30:32 rancher1 systemd-networkd[1496]: lxc_health: Gained IPv6LL
Mar 03 06:30:35 rancher1 systemd-resolved[1604]: Using degraded feature set TCP instead of UDP for DNS server 1.1.1.1.
Mar 03 06:30:46 rancher1 systemd-resolved[1604]: Using degraded feature set TCP instead of UDP for DNS server 8.8.4.4.
Mar 03 06:30:49 rancher1 env[1790]: time="2022-03-03T06:30:49.325354552Z" level=error msg="failed to reload cni configuration after receiving fs change event(\"/etc/cni/net.d/05-cilium.conf\": REMOVE)" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Mar 03 06:30:49 rancher1 systemd-networkd[1496]: lxc_health: Link DOWN
Mar 03 06:30:49 rancher1 systemd-networkd[1496]: lxc_health: Lost carrier

after that, the link stays down, however ip addr reported otherwise. update_engine did run afterwards and failed to gain any connectivity the same with systemd-resolved and systemd-networkd was basically silent after the message on 06:30 utc.

switch reboot took like 2-3 minutes.

and it's really wierd that the link got up and down in a short window.


I'm not sure if I really hit a bug or just bad luck happened, but maybe somebody else can give some insights if this can be easily debugged and can help me find some culprints? pstore btw. is empty.

pothos commented 2 years ago

Hello, please see https://github.com/flatcar-linux/Flatcar/issues/620 for the workaround - the upcoming Flatcar releases will have this set by default.

schmitch commented 2 years ago

@pothos thank you! I rebooted the switch to see if the issue is reproducible (it is), I will try to reboot it once more tomorrow and report if the fix worked.

pothos commented 2 years ago

Maybe you can also reproduce it with something like networkctl down eth0 ; networkctl up eth0 (at least that worked in a simple qemu VM). I expect it to be the same issue and will close now. Please reopen if the workaround doesn't help.

schmitch commented 2 years ago

btw. the fix worked flawlessly! thank you