Endpoint errors after Cilium upgrade of 1.14.7 to 1.15.X

rsavage-nozominetworks commented 6 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

Network connectivity issues upgrading from 1.14.7 to 1.15.1. Followed the upgrade guide for 1.15.1 precisely.

Two tests so far same result. What's odd is, eventually these errors clear out, and all connectivity is restored, however it takes awhile (first test took 2-3 minutes to recover), (second test took almost 10 minutes to recover). These are small test clusters, and I am very hesitate to attempt this on a larger live cluster now.

Set: upgradeCompatibility: "1.14" Set: routingMode: native Set: tunnelProtocol: ""

P.S. I am also running cilium in chaining mode with the aws vpc-cni as well.

Cilium Version

Version: 1.14.7 -> 1.15.1 (upgrade attempt)

Kernel Version

Kernel: 5.10.205-195.807.amzn2.x86_64

Kubernetes Version

1.29

Regression

No response

Sysdump

No response

Relevant log output

~ ᐅ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         4 errors
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         OK
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        hubble-relay       Desired: 2, Ready: 2/2, Available: 2/2
Deployment        cilium-operator    Desired: 3, Ready: 3/3, Available: 3/3
Deployment        hubble-ui          Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet         cilium             Desired: 9, Ready: 9/9, Available: 9/9
Containers:       cilium             Running: 9
                  hubble-relay       Running: 2
                  cilium-operator    Running: 3
                  hubble-ui          Running: 2
Cluster Pods:     153/155 managed by Cilium
Image versions    hubble-relay       quay.io/cilium/hubble-relay:v1.15.1@sha256:3254aaf85064bc1567e8ce01ad634b6dd269e91858c83be99e47e685d4bb8012: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.15.1@sha256:819c7281f5a4f25ee1ce2ec4c76b6fbc69a660c68b7825e9580b1813833fa743: 3
                  hubble-ui          quay.io/cilium/hubble-ui:v0.13.0@sha256:7d663dc16538dd6e29061abd1047013a645e6e69c115e008bee9ea9fef9a6666: 2
                  hubble-ui          quay.io/cilium/hubble-ui-backend:v0.13.0@sha256:1e7657d997c5a48253bb8dc91ecee75b63018d16ff5e5797e5af367336bc8803: 2
                  cilium             quay.io/cilium/cilium:v1.15.1@sha256:351d6685dc6f6ffbcd5451043167cfa8842c6decf80d8c8e426a417c73fb56d4: 9
Errors:           cilium             cilium-bf8p2    controller ep-bpf-prog-watchdog is failing since 6s (4x): unable to find endpoint link by name: Link not found
                  cilium             cilium-vhp87    controller ep-bpf-prog-watchdog is failing since 25s (28x): unable to find endpoint link by name: Link not found
                  cilium             cilium-vhp87    controller endpoint-29-regeneration-recovery is failing since 11s (21x): regeneration recovery failed
                  cilium             cilium-wvmps    controller ep-bpf-prog-watchdog is failing since 4s (7x): unable to find endpoint link by name: Link not found

--
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon
level=error msg="Unable to assert if endpoint BPF programs need to be reloaded" endpoint=enid0436d4012a endpointID=1702 error="unable to find endpoint link by name: Link not found" subsys=daemon

Anything else?

No response

Cilium Users Document

[X] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

[X] I agree to follow this project's Code of Conduct

rsavage-nozominetworks commented 6 months ago

So interesting discovery on my end. I use hubble with it's relay and ui. If I disable hubble prior to the upgrade, and I perform the upgrade with hubble disabled, the upgrade goes smooth. If I leave hubble enabled, the upgrade hits the errors above.

rsavage-nozominetworks commented 6 months ago

Update to this issue. I performed the 1.14.7->1.15.1 upgrade on another 1.29 cluster (which is a little larger and more active), and immediately hit the unable to find endpoint link by name: Link not found issue. Even with hubble disabled. I am wondering how others are able to upgrade without noticing this problem?

Steps I take

1-Perform pre-flight check (per upgrade guide) 2-Upgrade cilium from 1.14.7 -> 1.15.1 with hubble disabled 3-Re-enable hubble

Still receive link errors. (I hadn't re-enabled hubble in this test because I was hitting those errors).

Screen Shot 2024-03-13 at 12 58 38 PM

rsavage-nozominetworks commented 6 months ago

Update: I have tried the the latest 1.15.2 release and still getting the same result.

rsavage-nozominetworks commented 6 months ago

Usually, these errors occur a few minutes right after the upgrade is complete. Then the amount of "unable to find endpoint link by name: Link not found" errors start to increase and more and more cilium pods start erroring out.

dylandreimerink commented 6 months ago

Thanks for this report, would you be able to grab a sysdump file and upload it please so we can better debug the issue? (Using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection)

joestringer commented 5 months ago

A sysdump would be very useful for digging deeper into this, as it would contain all potential information that may be present that is related to the failure. Better still would be one sysdump before upgrade and another sysdump afterwards with the symptoms.

Alternatively if someone can provide instructions on how to set up an environment with an earlier Cilium that then produces this behaviour when upgraded, then this could be very helpful. That way, we wouldn't need a sysdump as we could just run the commands to reproduce the behaviour.

Short of that, even if someone is able to provide a sample of the same log symptoms (with msg Unable to assert ...) as well as the cilium endpoint list output in the same cilium-agent instance, then this will help to narrow down which endpoints are impacted - for instance, is it just the host or cilium-health endpoints or is it for an active workload?

aanm commented 2 months ago

@rsavage-nozominetworks are you able to provide us with a sysdump? Thank you

cilium / cilium