Open rsavage-nozominetworks opened 6 months ago
So interesting discovery on my end. I use hubble with it's relay and ui. If I disable hubble prior to the upgrade, and I perform the upgrade with hubble disabled, the upgrade goes smooth. If I leave hubble enabled, the upgrade hits the errors above.
Update to this issue. I performed the 1.14.7->1.15.1 upgrade on another 1.29 cluster (which is a little larger and more active), and immediately hit the unable to find endpoint link by name: Link not found
issue. Even with hubble disabled. I am wondering how others are able to upgrade without noticing this problem?
Steps I take
1-Perform pre-flight check (per upgrade guide) 2-Upgrade cilium from 1.14.7 -> 1.15.1 with hubble disabled 3-Re-enable hubble
Still receive link errors. (I hadn't re-enabled hubble in this test because I was hitting those errors).
Update: I have tried the the latest 1.15.2 release and still getting the same result.
Usually, these errors occur a few minutes right after the upgrade is complete. Then the amount of "unable to find endpoint link by name: Link not found" errors start to increase and more and more cilium pods start erroring out.
Thanks for this report, would you be able to grab a sysdump file and upload it please so we can better debug the issue? (Using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection)
A sysdump would be very useful for digging deeper into this, as it would contain all potential information that may be present that is related to the failure. Better still would be one sysdump before upgrade and another sysdump afterwards with the symptoms.
Alternatively if someone can provide instructions on how to set up an environment with an earlier Cilium that then produces this behaviour when upgraded, then this could be very helpful. That way, we wouldn't need a sysdump as we could just run the commands to reproduce the behaviour.
Short of that, even if someone is able to provide a sample of the same log symptoms (with msg Unable to assert ...
) as well as the cilium endpoint list
output in the same cilium-agent instance, then this will help to narrow down which endpoints are impacted - for instance, is it just the host or cilium-health endpoints or is it for an active workload?
@rsavage-nozominetworks are you able to provide us with a sysdump? Thank you
Is there an existing issue for this?
What happened?
Network connectivity issues upgrading from
1.14.7
to1.15.1
. Followed the upgrade guide for 1.15.1 precisely.Two tests so far same result. What's odd is, eventually these errors clear out, and all connectivity is restored, however it takes awhile (first test took 2-3 minutes to recover), (second test took almost 10 minutes to recover). These are small test clusters, and I am very hesitate to attempt this on a larger live cluster now.
Set:
upgradeCompatibility: "1.14"
Set:routingMode: native
Set:tunnelProtocol: ""
P.S. I am also running cilium in
chaining mode
with the awsvpc-cni
as well.Cilium Version
Version: 1.14.7 -> 1.15.1 (upgrade attempt)
Kernel Version
Kernel: 5.10.205-195.807.amzn2.x86_64
Kubernetes Version
1.29
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct