cnti-testcatalog / testsuite

📞📱☎️📡🌐 Cloud Native Telecom Initiative (CNTI) Test Catalog is a tool to check for and provide feedback on the use of K8s + cloud native best practices in networking applications and platforms
https://wiki.lfnetworking.org/display/LN/Test+Catalog
Apache License 2.0
169 stars 70 forks source link

[BUG] cni_compatible test failing due to outdated Cilium #2015

Closed svteb closed 1 month ago

svteb commented 1 month ago

Describe the bug For some reason my machine is incapable of creating the Cilium cluster as described in the setup_cilium_cluster function, that is with --version 1.10.5. Below are some outputs that describe the problem. Regardless, I've found out that using a newer Cilium version (such as 1.15.4) makes the cluster deploy successfully. I am not sure if changing the version won't break something (considering I have already tested it, it likely shouldn't).

Pods are forever stuck in ContainerCreating:

kubectl get pods -A --kubeconfig /home/ubuntu/.cnf-testsuite/tools/kind/cilium-test_admin.conf
NAMESPACE            NAME                                                READY   STATUS              RESTARTS   AGE
cnfspace             coredns-coredns-6fc69fdfd7-gp72g                    0/1     ContainerCreating   0          5m51s
kube-system          cilium-g8nn5                                        1/1     Running             0          10m
kube-system          cilium-operator-5cd47845bf-h6g5d                    1/1     Running             1          10m
kube-system          coredns-558bd4d5db-hk6g8                            0/1     ContainerCreating   0          10m
kube-system          coredns-558bd4d5db-xz4bd                            0/1     ContainerCreating   0          10m
kube-system          etcd-cilium-test-control-plane                      1/1     Running             0          10m
kube-system          kube-apiserver-cilium-test-control-plane            1/1     Running             0          10m
kube-system          kube-controller-manager-cilium-test-control-plane   1/1     Running             0          10m
kube-system          kube-proxy-t8s5g                                    1/1     Running             0          10m
kube-system          kube-scheduler-cilium-test-control-plane            1/1     Running             1          10m
local-path-storage   local-path-provisioner-85494db59d-bjtms             0/1     ContainerCreating   0          10m

Failing events:

kubectl describe pod coredns-558bd4d5db-hk6g8 -n kube-system --kubeconfig /home/ubuntu/.cnf-testsuite/tools/kind/cilium-test_admin.conf
Name:                 coredns-558bd4d5db-hk6g8
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 cilium-test-control-plane/172.18.0.3
Start Time:           Mon, 06 May 2024 06:12:20 +0000
Labels:               k8s-app=kube-dns
                      pod-template-hash=558bd4d5db
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-558bd4d5db
Containers:
  coredns:
    Container ID:  
    Image:         k8s.gcr.io/coredns/coredns:v1.8.0
    Image ID:      
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7x6f2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  kube-api-access-7x6f2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               10m                default-scheduler  Successfully assigned kube-system/coredns-558bd4d5db-hk6g8 to cilium-test-control-plane
  Warning  FailedScheduling        12m (x2 over 12m)  default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Warning  FailedCreatePodSandBox  8m59s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fd8413af31a7da4b9be76534009b5a2227a992b08f6f1ff865d633375a45e7fd": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  7m15s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a243887b52479c5c420449424ba67ace45b040c879576ea206dbcaea72455f3d": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  5m34s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cb56ad22c261ac9254a6dce3bc271379b4ace1d23633079e3d3c122d23513bdc": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  3m52s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9ece139b4a71c544f69fecfd1e1194e53ea3c1b300d5f9917bbb8cc07f9cf5eb": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  2m9s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "963f1076eb73eed13ed9f4db65f0a45872f0928ef4598891bb3565bb738b0d7e": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  23s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8d0905c8cc9b19c57f22a5eba492db149285fe9910d270f3d0ae5e02e976b832": Unable to create endpoint: Cilium API client timeout exceeded

Strangely high CPU usage:

docker stats
CONTAINER ID   NAME                        CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
f1f8a01d94ec   cilium-test-control-plane   526.35%   868.4MiB / 31.34GiB   2.71%     172MB / 35.4MB    950kB / 1.32GB    284
a7dff7d9c8ae   calico-test-control-plane   24.21%    985MiB / 31.34GiB     3.07%     198MB / 6.86MB    25.8MB / 1.71GB   473
top - 06:27:32 up 6 days, 22:46,  0 users,  load average: 9.13, 7.90, 5.87
Tasks:  37 total,   6 running,  31 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.4 us, 66.1 sy,  0.0 ni, 22.5 id,  5.9 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  32092.5 total,   2628.5 free,   3065.2 used,  26398.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  28533.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                       
   6950 root      20   0    7672   3596   2508 R  99.3   0.0   0:11.03 tc                                                                                            
   6957 root      20   0    7416   3520   2560 R  99.3   0.0   0:07.21 tc                                                                                            
   6953 root      20   0    7416   3576   2616 R  98.7   0.0   0:09.26 tc                                                                                            
   6973 root      20   0    7288   3264   2464 R  98.7   0.0   0:04.91 tc                                                                                            
   6959 root      20   0    7288   3424   2628 R  98.3   0.0   0:06.68 tc

To Reproduce Steps to reproduce the behavior:

  1. ./cnf-testsuite cnf_setup cnf-config=sample-cnfs/sample-coredns-cnf/cnf-testsuite.yml
  2. ./cnf-testsuite cni_compatible
  3. kubectl get pods -A --kubeconfig ~/.cnf-testsuite/tools/kind/cilium-test_admin.conf
  4. Calico seems to pass for me but Cilium fails after the 180 attempts timeout.

Expected behavior Cilium cluster should be deployed without issues.

Device:

Linux, Ubuntu server 22.04, x86 kind version: v0.22.0 minikube version: v1.32.0 kubectl version: v1.23.13

Once this issue is address how will the fix be verified? Hopefully it will not break the github actions :).