cilium / cilium-cli

CLI to install, manage & troubleshoot Kubernetes clusters running Cilium
https://cilium.io
Apache License 2.0
390 stars 196 forks source link

connectivity tests [check-log-errors] failing on transient error #2577

Open camrynl opened 4 weeks ago

camrynl commented 4 weeks ago

Seeing connectivity test consistently fail on the check-log-errors test case with AKS cluster. The log error message is transient as it recovers shortly after.

Connectivity Test Output:

ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [ciliumnightly-2ab9cfe8-030632] Creating namespace cilium-test for connectivity check...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying echo-same-node service...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying DNS test server configmap...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying same-node deployment...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying client deployment...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying client2 deployment...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying client3 deployment...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying echo-other-node service...
✨ [ciliumnightly-2ab9cfe8-030632] Deploying other-node deployment...
✨ [host-netns] Deploying ciliumnightly-2ab9cfe8-030632 daemonset...
✨ [host-netns-non-cilium] Deploying ciliumnightly-2ab9cfe8-030632 daemonset...
ℹ️  Skipping tests that require a node Without Cilium
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for deployment cilium-test/client to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for deployment cilium-test/client2 to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for deployment cilium-test/client3 to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client-69748f45d8-n45rh to reach DNS server on cilium-test/echo-same-node-c74f867d5-dwxws pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client2-56896f78f8-5ck7n to reach DNS server on cilium-test/echo-same-node-c74f867d5-dwxws pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client3-5f56bc568c-qmhbk to reach DNS server on cilium-test/echo-same-node-c74f867d5-dwxws pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client-69748f45d8-n45rh to reach DNS server on cilium-test/echo-other-node-6bb6657c95-ljhzd pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client2-56896f78f8-5ck7n to reach DNS server on cilium-test/echo-other-node-6bb6657c95-ljhzd pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client3-5f56bc568c-qmhbk to reach DNS server on cilium-test/echo-other-node-6bb6657c95-ljhzd pod...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client-69748f45d8-n45rh to reach default/kubernetes service...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client2-56896f78f8-5ck7n to reach default/kubernetes service...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for pod cilium-test/client3-5f56bc568c-qmhbk to reach default/kubernetes service...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-other-node to be synchronized by Cilium pod kube-system/cilium-hkg8n
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-other-node to be synchronized by Cilium pod kube-system/cilium-n5dsz
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-same-node to be synchronized by Cilium pod kube-system/cilium-hkg8n
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for Service cilium-test/echo-same-node to be synchronized by Cilium pod kube-system/cilium-n5dsz
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for NodePort 10.10.0.4:31158 (cilium-test/echo-other-node) to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for NodePort 10.10.0.4:30613 (cilium-test/echo-same-node) to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for NodePort 10.10.0.5:31158 (cilium-test/echo-other-node) to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for NodePort 10.10.0.5:30613 (cilium-test/echo-same-node) to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for DaemonSet cilium-test/host-netns-non-cilium to become ready...
⌛ [ciliumnightly-2ab9cfe8-030632] Waiting for DaemonSet cilium-test/host-netns to become ready...
ℹ️  Skipping IPCache check
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"

[=] [cilium-test] Test [check-log-errors] [78/78]
.
  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (install-cni-binaries)]
  [-] Scenario [check-log-errors/no-errors-in-logs]
  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (cilium-agent)]
  ❌ Found 4 logs in ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (cilium-agent) matching list of errors that must be investigated:
time="2024-06-03T07:11:25Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
time="2024-06-03T07:12:02Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
time="2024-06-03T07:12:49Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
time="2024-06-03T07:14:07Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (mount-cgroup)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (apply-sysctl-overwrites)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (mount-bpf-fs)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (clean-cilium-state)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (systemd-networkd-overrides)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (block-wireserver)]
  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (cilium-agent)]
..  ❌ Found 3 logs in ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (cilium-agent) matching list of errors that must be investigated:
time="2024-06-03T07:14:08Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
time="2024-06-03T07:15:35Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (2 occurrences)
time="2024-06-03T07:12:03Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader (1 occurrences)
  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (install-cni-binaries)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (mount-cgroup)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (apply-sysctl-overwrites)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (mount-bpf-fs)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (clean-cilium-state)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (systemd-networkd-overrides)]
.  [.] Action [check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (block-wireserver)]
.
📋 Test Report [cilium-test]
❌ 1/37 tests failed (2/353 actions), 41 tests skipped, 0 scenarios skipped:
Test [check-log-errors]:
  ❌ check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-n5dsz (cilium-agent)
  ❌ check-log-errors/no-errors-in-logs/ciliumnightly-2ab9cfe8-030632/kube-system/cilium-hkg8n (cilium-agent)
[cilium-test] 1 tests failed

Cilium logs showing error and recovery:

time=2024-05-30T22:07:56Z level=info msg="Devices changed" module=agent.datapath.devices-controller devices=[]
time="2024-05-30T22:07:57Z" level=info msg="Node addresses updated" device=eth0 node-addresses= subsys=node-address
time="2024-05-30T22:07:57Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=datapath-loader
time="2024-05-30T22:07:57Z" level=warning msg="Failed to reload datapath" error="unable to recompile base programs from devices changed: failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: IPv4 direct routing device not found" subsys=daemon
time=2024-05-30T22:07:57Z level=info msg="Devices changed" module=agent.datapath.devices-controller devices=[eth0]
time="2024-05-30T22:07:57Z" level=info msg="Node addresses updated" device=eth0 node-addresses="10.10.0.5 (primary, nodeport)" subsys=node-address
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock6_post_bind for program cil_sock6_post_bind" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock6_connect for program cil_sock6_connect" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock6_sendmsg for program cil_sock6_sendmsg" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock6_recvmsg for program cil_sock6_recvmsg" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock4_getpeername for program cil_sock4_getpeername" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock4_post_bind for program cil_sock4_post_bind" subsys=socketlb
time="2024-05-30T22:07:58Z" level=info msg="Updated link /sys/fs/bpf/cilium/socketlb/links/cgroup/cil_sock6_getpeername for program cil_sock6_getpeername" subsys=socketlb

Cilium version: 1.16.0 Cilium-cli version: v0.16.8

Is there a way for us to avoid these transient errors? I see that there are some exceptions for this test, but seeing as this is for last resort cases, how are these minor error messages dealt with? https://github.com/cilium/cilium-cli/blob/101777f9d29e075277e2522a07df2d14d60a9ab4/connectivity/tests/errors.go#L48