Closed schmitch closed 2 years ago
Hello, please see https://github.com/flatcar-linux/Flatcar/issues/620 for the workaround - the upcoming Flatcar releases will have this set by default.
@pothos thank you! I rebooted the switch to see if the issue is reproducible (it is), I will try to reboot it once more tomorrow and report if the fix worked.
Maybe you can also reproduce it with something like networkctl down eth0 ; networkctl up eth0
(at least that worked in a simple qemu VM). I expect it to be the same issue and will close now. Please reopen if the workaround doesn't help.
btw. the fix worked flawlessly! thank you
Description
currently we do run 3x Dell Servers 7525 (AMD EPYC 7313) with Mellanox ConnectX-5 EN 25GbE Dual-port SFP28 Adapter's and also some Meraki Switches, this is a test cluster that should be going live in a month or so, inside a much more volatile network. We use cilium with hardware offloading (dsr, native acceleration)/kube proxy disabled/kube-vip on a k3s 1.23.4 and this is a baremetal install with static ip addresses. and of course the latest stable version:
Flatcar Container Linux by Kinvolk stable 3033.2.2
This night we had a scheduled maintanence on our switch and after that the network connectivity was down on all nic's (we run on a single port). The ports were online according to
ip addr
, however evenping
on the same ip address that the host uses returned 100% connection loss (we could connect over idrac, dells management engine).We have no idea why that was happening, because I rebooted the nodes before inspecting it, because I did not know of the switch reboot. And I found it extremly wierd, that I could not ping the machines static ip address, which basically never happend in my life before.
Impact
no connectivity
Expected behavior
get connectivity back after a short while
Additional information
we will retry the switch reboot experiment today
I have a systemd log but in the log it only says:
after that, the link stays down, however
ip addr
reported otherwise.update_engine
did run afterwards and failed to gain any connectivity the same with systemd-resolved and systemd-networkd was basically silent after the message on 06:30 utc.switch reboot took like 2-3 minutes.
and it's really wierd that the link got up and down in a short window.
I'm not sure if I really hit a bug or just bad luck happened, but maybe somebody else can give some insights if this can be easily debugged and can help me find some culprints?
pstore
btw. is empty.