Closed christarazi closed 1 month ago
catched another one and detected the following in the sysdump
https://github.com/cilium/cilium/actions/runs/9107971706/job/25037869655
(inner sysdump cilium-sysdump.zip)
kube-system cilium-9n6jt 0/1 CrashLoopBackOff 4 (81s ago) 4m1s 172.18.0.2 kind-worker <none> <none>
containerStatuses:
- containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
image: quay.io/cilium/cilium-ci:4aa6a7fa004b3147bb0cf6f368bebb8f4df450df
imageID: quay.io/cilium/cilium-ci@sha256:39c4823838cae593227de56881164efd72b15534cf82395999c83d2746c43d97
lastState:
terminated:
containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
exitCode: 1
finishedAt: "2024-05-16T07:32:49Z"
message: |-
time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 118.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca798fcaa63467381be04513cbeb161c9041d734680d12cf82541b2db0 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" endpointID=2645 subsys=endpoint
time="2024-05-16T07:32:48Z" level=debug msg="Load 1-min: 1.48 5-min: 3.11 15min: 2.79" subsys=loadinfo type=background
time="2024-05-16T07:32:48Z" level=debug msg="Memory: Total: 11969 Used: 2138 (17.86%) Free: 2335 Buffers: 145 Cached: 7155" subsys=loadinfo type=background
time="2024-05-16T07:32:48Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
time="2024-05-16T07:32:48Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 30.13% MEM: 1.17% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 140 VMS: 1308 Data: 108 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 128.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca
reason: Error
startedAt: "2024-05-16T07:32:36Z"
name: cilium-agent
ready: false
restartCount: 4
started: false
state:
waiting:
message: back-off 1m20s restarting failed container=cilium-agent pod=cilium-9n6jt_kube-system(3726000e-e008-4b5c-b82d-55b295cf8cd6)
reason: CrashLoopBackOff
another one (https://github.com/cilium/cilium/actions/runs/9101307271) with the same ending loglines in the prev
logfile of the agents. (inner sysdump: cilium-sysdump.zip)
2024-05-15T19:36:27.076881948Z time="2024-05-15T19:36:27Z" level=debug msg="Controller run succeeded; waiting for next controller update or stop" name=ipcache-inject-labels subsys=controller uuid=9208713a-1b4f-4c51-977e-977c62191690
2024-05-15T19:36:27.076884613Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076887719Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2116 (17.68%) Free: 2361 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076890233Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076892908Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.71% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.080792604Z time="2024-05-15T19:36:27Z" level=debug msg="Controller func execution time: 5.009658018s" name="sync-to-k8s-ciliumendpoint (347)" subsys=controller uuid=0ef00285-88af-4b92-8cd4-296fbf43fde3
2024-05-15T19:36:27.080978030Z time="2024-05-15T19:36:27Z" level=debug msg="upserting health status" reporter-id="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync" status="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync: [OK] sync-to-k8s-ciliumendpoint (347)" subsys=healthv2
2024-05-15T19:36:27.084355080Z time="2024-05-15T19:36:27Z" level=debug msg="Compiling datapath" clang="clang version 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18)\nTarget: unknown\nThread model: posix\nInstalledDir: /usr/local/bin\n" debug=true subsys=datapath-loader
2024-05-15T19:36:27.084462357Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -E -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.123704426Z time="2024-05-15T19:36:27Z" level=debug msg="Compilation had peak RSS of 141772 bytes" compiler-pid=397 output=/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043/bpf_lxc.c subsys=datapath-loader
2024-05-15T19:36:27.123854806Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.590204451Z time="2024-05-15T19:36:27Z" level=debug msg="CES deleted, calling endpointDeleted" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-2qrzhpbx4-vp65v subsys=k8s-watcher
2024-05-15T19:36:27.599922248Z time="2024-05-15T19:36:27Z" level=debug msg="CEP inserted, calling endpointUpdated" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-xdkmhv42f-4j2d2 subsys=k8s-watcher
2024-05-15T19:36:27.600134524Z time="2024-05-15T19:36:27Z" level=debug msg="Upserting IP into ipcache layer" identity="{112 custom-resource [] false false}" ipAddr=10.0.0.107 k8sNamespace=kube-system k8sPodName=coredns-7db6d8ff4d-hjnjp key=0 namedPorts="map[dns:{53 17} dns-tcp:{53 6} metrics:{9153 6}]" subsys=ipcache
2024-05-15T19:36:27.600380186Z time="2024-05-15T19:36:27Z" level=debug msg="Daemon notified of IP-Identity cache state change" identity="{112 custom-resource [] false false}" ipAddr="{10.0.0.107 ffffffff}" modification=Upsert subsys=datapath-ipcache
2024-05-15T19:36:27.721631463Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" subsys=loadinfo type=background
2024-05-15T19:36:27.721940229Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" subsys=loadinfo type=background
2024-05-15T19:36:27.723146119Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723159083Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.79% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723163091Z time="2024-05-15T19:36:27Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 122.92% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.780576147Z time="2024-05-15T19:36:27Z" level=info msg="starting xDS stream processing" subsys=xds xdsStreamID=1
2024-05-15T19:36:27.782123083Z time="2024-05-15T19:36:27Z" level=debug msg="received request from xDS stream" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782134003Z time="2024-05-15T19:36:27Z" level=debug msg="xDS was restarted, setting nonce to 1" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782297037Z time="2024-05-15T19:36:27Z" level=debug msg="notifying observers of ACKs" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782306946Z time="2024-05-15T19:36:27Z" level=debug msg="starting watch on 0 resources" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782399248Z time="2024-05-15T19:36:27Z" level=debug msg="current resource version is 1, waiting for it to become > 1" subsys=xds xdsAckedVersion=1 xdsClientNode=127.0.0.1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:28.076216711Z time="2024-05-15T19:36:28Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076522622Z time="2024-05-15T19:36:28Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076528853Z time="2024-05-15T19:36:28Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077147909Z time="2024-05-15T19:36:28Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 29.48% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 139 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077624929Z time="2024-05-15T19:36:28Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 62.84% MEM: 0.62% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 124 Data: 19 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
=> clang issue? (maybe also red herring as cilium seems to print the status of long-running clang every 5 seconds)
maybe the timeout of 240s from creating the Cilium Agent DaemonSet to readiness of all its Pods is no longer enough? :face_with_diagonal_mouth:
We also have similar ci flake issues - e.g. https://github.com/cilium/cilium/issues/32431
it seems to take quite a long time until the health endpoint is up and ready - and this is used for the Pods readiness probe.
Another one: https://github.com/cilium/cilium/actions/runs/9379984240/job/25826829114
K8sDatapathConfig High-scale IPcache
Test ingress policy enforcement with GENEVE and endpoint routes
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
...
08:22:17 STEP: WaitforPods(namespace="default", filter="-l type=client") => timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expire
...
FAIL: Client pods not ready after timeout
Unexpected error:
<*fmt.wrapError | 0xc000068080>:
timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired
{
msg: "timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired",
err: <*errors.errorString | 0xc000600000>{
s: "8m0s timeout expired",
},
}
occurred
Full logs: logs_24547426283.zip
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.
CI failure
Sysdump is too large to upload here.
https://github.com/cilium/cilium/actions/runs/8980253691