cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.16k stars 2.96k forks source link

CI: Conformance Ginkgo: Client pods not ready after timeout: `timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired` #32404

Closed christarazi closed 1 month ago

christarazi commented 6 months ago

CI failure


• Failure [1086.350 seconds]
K8sDatapathConfig
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  High-scale IPcache
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Test ingress policy enforcement with GENEVE and endpoint routes [It]
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

    Client pods not ready after timeout
    Unexpected error:
        <*fmt.wrapError | 0xc003562e20>: 
        timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired
        {
            msg: "timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired",
            err: <*errors.errorString | 0xc003561e90>{
                s: "8m0s timeout expired",
            },
        }
    occurred

    /home/runner/work/cilium/cilium/test/k8s/datapath_configuration.go:695

Sysdump is too large to upload here.

https://github.com/cilium/cilium/actions/runs/8980253691

mhofstetter commented 5 months ago

catched another one and detected the following in the sysdump

https://github.com/cilium/cilium/actions/runs/9107971706/job/25037869655

(inner sysdump cilium-sysdump.zip)

kube-system         cilium-9n6jt                                 0/1     CrashLoopBackOff   4 (81s ago)     4m1s    172.18.0.2   kind-worker          <none>           <none>
    containerStatuses:
    - containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
      image: quay.io/cilium/cilium-ci:4aa6a7fa004b3147bb0cf6f368bebb8f4df450df
      imageID: quay.io/cilium/cilium-ci@sha256:39c4823838cae593227de56881164efd72b15534cf82395999c83d2746c43d97
      lastState:
        terminated:
          containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
          exitCode: 1
          finishedAt: "2024-05-16T07:32:49Z"
          message: |-
            time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 118.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca798fcaa63467381be04513cbeb161c9041d734680d12cf82541b2db0 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" endpointID=2645 subsys=endpoint
            time="2024-05-16T07:32:48Z" level=debug msg="Load 1-min: 1.48 5-min: 3.11 15min: 2.79" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="Memory: Total: 11969 Used: 2138 (17.86%) Free: 2335 Buffers: 145 Cached: 7155" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 30.13% MEM: 1.17% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 140 VMS: 1308 Data: 108 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 128.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca
          reason: Error
          startedAt: "2024-05-16T07:32:36Z"
      name: cilium-agent
      ready: false
      restartCount: 4
      started: false
      state:
        waiting:
          message: back-off 1m20s restarting failed container=cilium-agent pod=cilium-9n6jt_kube-system(3726000e-e008-4b5c-b82d-55b295cf8cd6)
          reason: CrashLoopBackOff

another one (https://github.com/cilium/cilium/actions/runs/9101307271) with the same ending loglines in the prev logfile of the agents. (inner sysdump: cilium-sysdump.zip)

2024-05-15T19:36:27.076881948Z time="2024-05-15T19:36:27Z" level=debug msg="Controller run succeeded; waiting for next controller update or stop" name=ipcache-inject-labels subsys=controller uuid=9208713a-1b4f-4c51-977e-977c62191690
2024-05-15T19:36:27.076884613Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076887719Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2116 (17.68%) Free: 2361 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076890233Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076892908Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.71% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.080792604Z time="2024-05-15T19:36:27Z" level=debug msg="Controller func execution time: 5.009658018s" name="sync-to-k8s-ciliumendpoint (347)" subsys=controller uuid=0ef00285-88af-4b92-8cd4-296fbf43fde3
2024-05-15T19:36:27.080978030Z time="2024-05-15T19:36:27Z" level=debug msg="upserting health status" reporter-id="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync" status="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync: [OK] sync-to-k8s-ciliumendpoint (347)" subsys=healthv2
2024-05-15T19:36:27.084355080Z time="2024-05-15T19:36:27Z" level=debug msg="Compiling datapath" clang="clang version 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18)\nTarget: unknown\nThread model: posix\nInstalledDir: /usr/local/bin\n" debug=true subsys=datapath-loader
2024-05-15T19:36:27.084462357Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -E -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.123704426Z time="2024-05-15T19:36:27Z" level=debug msg="Compilation had peak RSS of 141772 bytes" compiler-pid=397 output=/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043/bpf_lxc.c subsys=datapath-loader
2024-05-15T19:36:27.123854806Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.590204451Z time="2024-05-15T19:36:27Z" level=debug msg="CES deleted, calling endpointDeleted" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-2qrzhpbx4-vp65v subsys=k8s-watcher
2024-05-15T19:36:27.599922248Z time="2024-05-15T19:36:27Z" level=debug msg="CEP inserted, calling endpointUpdated" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-xdkmhv42f-4j2d2 subsys=k8s-watcher
2024-05-15T19:36:27.600134524Z time="2024-05-15T19:36:27Z" level=debug msg="Upserting IP into ipcache layer" identity="{112 custom-resource [] false false}" ipAddr=10.0.0.107 k8sNamespace=kube-system k8sPodName=coredns-7db6d8ff4d-hjnjp key=0 namedPorts="map[dns:{53 17} dns-tcp:{53 6} metrics:{9153 6}]" subsys=ipcache
2024-05-15T19:36:27.600380186Z time="2024-05-15T19:36:27Z" level=debug msg="Daemon notified of IP-Identity cache state change" identity="{112 custom-resource [] false false}" ipAddr="{10.0.0.107 ffffffff}" modification=Upsert subsys=datapath-ipcache
2024-05-15T19:36:27.721631463Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" subsys=loadinfo type=background
2024-05-15T19:36:27.721940229Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" subsys=loadinfo type=background
2024-05-15T19:36:27.723146119Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723159083Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.79% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723163091Z time="2024-05-15T19:36:27Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 122.92% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.780576147Z time="2024-05-15T19:36:27Z" level=info msg="starting xDS stream processing" subsys=xds xdsStreamID=1
2024-05-15T19:36:27.782123083Z time="2024-05-15T19:36:27Z" level=debug msg="received request from xDS stream" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782134003Z time="2024-05-15T19:36:27Z" level=debug msg="xDS was restarted, setting nonce to 1" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782297037Z time="2024-05-15T19:36:27Z" level=debug msg="notifying observers of ACKs" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782306946Z time="2024-05-15T19:36:27Z" level=debug msg="starting watch on 0 resources" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782399248Z time="2024-05-15T19:36:27Z" level=debug msg="current resource version is 1, waiting for it to become > 1" subsys=xds xdsAckedVersion=1 xdsClientNode=127.0.0.1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:28.076216711Z time="2024-05-15T19:36:28Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076522622Z time="2024-05-15T19:36:28Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076528853Z time="2024-05-15T19:36:28Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077147909Z time="2024-05-15T19:36:28Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 29.48% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 139 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077624929Z time="2024-05-15T19:36:28Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 62.84% MEM: 0.62% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 124 Data: 19 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint

=> clang issue? (maybe also red herring as cilium seems to print the status of long-running clang every 5 seconds)

mhofstetter commented 5 months ago

maybe the timeout of 240s from creating the Cilium Agent DaemonSet to readiness of all its Pods is no longer enough? :face_with_diagonal_mouth:

We also have similar ci flake issues - e.g. https://github.com/cilium/cilium/issues/32431

it seems to take quite a long time until the health endpoint is up and ready - and this is used for the Pods readiness probe.

pippolo84 commented 5 months ago

Another one: https://github.com/cilium/cilium/actions/runs/9379984240/job/25826829114

K8sDatapathConfig High-scale IPcache 
  Test ingress policy enforcement with GENEVE and endpoint routes
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

...

08:22:17 STEP: WaitforPods(namespace="default", filter="-l type=client") => timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expire

...

FAIL: Client pods not ready after timeout
Unexpected error:
    <*fmt.wrapError | 0xc000068080>: 
    timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired
    {
        msg: "timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired",
        err: <*errors.errorString | 0xc000600000>{
            s: "8m0s timeout expired",
        },
    }
occurred

Full logs: logs_24547426283.zip

christarazi commented 3 months ago

Seen in https://github.com/cilium/cilium/actions/runs/9979634838/job/27579354579 from https://github.com/cilium/cilium/pull/33860.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 1 month ago

This issue has not seen any activity since it was marked stale. Closing.