cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.32k stars 2.98k forks source link

CI: [v1.16] Cilium IPsec upgrade (ci-ipsec-upgrade) - conn-disrupt-test-check after downgrading #36086

Open rastislavs opened 4 days ago

rastislavs commented 4 days ago

CI failure

Seems to be consistently failing in v1.16 recently. The following test fails after downgrading Cilium:

[=] [cilium-test-1] Test [outside-to-ingress-service] [87/104]
.
[-] Scenario [outside-to-ingress-service/outside-to-ingress-service]
  [.] Action [outside-to-ingress-service/outside-to-ingress-service/curl-ingress-service-0: cilium-test-1/host-netns-non-cilium-bbgjk (172.18.0.5) -> cilium-test-1/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test-1:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://172.18.0.3:31985/" failed: error with exec request (pod=cilium-test-1/host-netns-non-cilium-bbgjk, container=host-netns-non-cilium): command terminated with exit code 28

CI job: https://github.com/cilium/cilium/actions/runs/11941673197/job/33287096362

cilium-junits.zip cilium-sysdumps.zip

asauber commented 4 days ago

x-ref discussion here: https://github.com/cilium/cilium/pull/36047#issuecomment-2487039121

joestringer commented 4 days ago

Given that the test is validating Ingress connectivity from a node "outside" the Cilium cluster in towards the cluster, encryption is not in play here despite the test being the "IPsec upgrade" test. Commit 9197100c6d25 introduced this testing on two specific combinations in the test, and it reliably fails on both configurations since v1.16.4 was released. While we are investigating, it may make sense to revert that commit on v1.16 branch in order to remove the reliable failure.

As for investigating further, we could look at adding equivalent testing to the regular upgrade tests and make sure to validate by pushing the PR to cilium/cilium and enable a pull_request: {} trigger in the workflow so that the changes from the workflow are being run. I am curious why the individual test passes when the agent is upgraded from v1.16.4 to tip of v1.16, but only fails after downgrade back from v1.16 tip to v1.16.4. I would have thought that this would be an equivalent operation, but the test seems to fail reliably for this particular transition. It would also be interesting to know whether the individual test consistently fails after that downgrade or whether it only fails for a short time and then Cilium reconciles some state to then recover connectivity.

harsimran-pabla commented 3 days ago

@joestringer I have created a revert for commit. cc @pchaigno So, it unblocks any other patches going into 1.16 while we investigate why we are seeing this failure.