cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.18k stars 2.96k forks source link

CI: multiple tests, `level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping."` #22601

Closed nbusseneau closed 1 year ago

nbusseneau commented 1 year ago

Test Name

K8sAgentHubbleTest Hubble Observe Test L3/L4 Flow
K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks service on same node
K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications
K8sAgentIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity
K8sAgentPolicyTest Basic Test TLS policy
K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies
K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping."

Stack Trace

/home/jenkins/workspace/cilium-master-k8s-1.22-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:415
Found 1 k8s-app=cilium logs matching list of errors that must be investigated:
level=error
/home/jenkins/workspace/cilium-master-k8s-1.22-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:413

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
⚠️  Found "level=error" in logs 3 times
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 3
⚠️  Number of "level=warning" in logs: 7
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 5 errors/warnings:
Failed to get possibly stale ciliumendpoints from apiserver, skipping.
CONFIG_CGROUP_BPF optional kernel parameter is not in kernel (needed for: Host Reachable Services and Sockmap optimization)
CONFIG_LWTUNNEL_BPF optional kernel parameter is not in kernel (needed for: Lightweight Tunnel hook for IP-in-IP encapsulation)
Key allocation attempt failed
Unable to restore endpoint, ignoring
Cilium pods: [cilium-57t87 cilium-kqnt2]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                             Ingress   Egress
app1-6bf9bf9bd5-jmz4l           false     false
app2-58757b7dd5-nrs8z           false     false
app3-5d69599cdd-nlcvw           false     false
coredns-69b675786c-dv6xk        false     false
hubble-relay-6f646854b5-hfnvn   false     false
app1-6bf9bf9bd5-9v26g           false     false
Cilium agent 'cilium-57t87': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 23 Failed 0
Cilium agent 'cilium-kqnt2': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 40 Failed 0

Standard Error

20:10:50 STEP: Running BeforeAll block for EntireTestsuite K8sAgentHubbleTest Hubble Observe
20:10:50 STEP: Ensuring the namespace kube-system exists
20:10:50 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
20:10:50 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
20:10:51 STEP: Deleting pods [echo-55fdf5787d-2jr79,echo-55fdf5787d-7t4kp] in namespace default
20:10:51 STEP: Waiting for 2 deletes to return (echo-55fdf5787d-2jr79,echo-55fdf5787d-7t4kp)
20:11:01 STEP: Unable to delete pods echo-55fdf5787d-7t4kp with 'kubectl -n default delete pods echo-55fdf5787d-7t4kp': Exitcode: -1 
Err: signal: killed
Stdout:
     pod "echo-55fdf5787d-7t4kp" deleted

Stderr:

20:11:01 STEP: Unable to delete pods echo-55fdf5787d-2jr79 with 'kubectl -n default delete pods echo-55fdf5787d-2jr79': Exitcode: -1 
Err: signal: killed
Stdout:
     pod "echo-55fdf5787d-2jr79" deleted

Stderr:

20:11:01 STEP: Installing Cilium
20:11:02 STEP: Waiting for Cilium to become ready
20:11:14 STEP: Restarting unmanaged pods hubble-relay-6f646854b5-4c6bt in namespace kube-system
20:11:15 STEP: Validating if Kubernetes DNS is deployed
20:11:15 STEP: Checking if deployment is ready
20:11:15 STEP: Checking if kube-dns service is plumbed correctly
20:11:15 STEP: Checking if DNS can resolve
20:11:15 STEP: Checking if pods have identity
20:11:17 STEP: Kubernetes DNS is up and operational
20:11:17 STEP: Validating Cilium Installation
20:11:17 STEP: Performing Cilium controllers preflight check
20:11:17 STEP: Performing Cilium health check
20:11:17 STEP: Checking whether host EP regenerated
20:11:17 STEP: Performing Cilium status preflight check
20:11:17 STEP: Performing Cilium service preflight check
20:11:17 STEP: Performing K8s service preflight check
20:11:18 STEP: Cilium is not ready yet: controllers are failing: cilium-agent 'cilium-57t87': controller ipcache-inject-labels is failing: Exitcode: 0 
Stdout:
     KVStore:                Ok   Disabled
     Kubernetes:             Ok   1.22 (v1.22.13) [linux/amd64]
     Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
     KubeProxyReplacement:   Disabled   
     Host firewall:          Disabled
     CNI Chaining:           none
     CNI Config file:        CNI configuration file management disabled
     Cilium:                 Ok   1.12.90 (v1.12.90-0f0b167e)
     NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
     IPAM:                   IPv4: 3/254 allocated from 10.0.0.0/24, IPv6: 3/254 allocated from fd02::/120
     IPv6 BIG TCP:           Disabled
     BandwidthManager:       Disabled
     Host Routing:           Legacy
     Masquerading:           IPTables [IPv4: Enabled, IPv6: Enabled]
     Controller Status:      20/21 healthy
       Name                                  Last success   Last error   Count   Message
       bpf-map-sync-cilium_lxc               6s ago         never        0       no error                     
       cilium-health-ep                      5s ago         never        0       no error                     
       dns-garbage-collector-job             9s ago         never        0       no error                     
       endpoint-2565-regeneration-recovery   never          never        0       no error                     
       endpoint-3100-regeneration-recovery   never          never        0       no error                     
       endpoint-3437-regeneration-recovery   never          never        0       no error                     
       endpoint-gc                           9s ago         never        0       no error                     
       ipcache-inject-labels                 never          7s ago       8       k8s cache not fully synced   
       k8s-heartbeat                         9s ago         never        0       no error                     
       link-cache                            6s ago         never        0       no error                     
       metricsmap-bpf-prom-sync              4s ago         never        0       no error                     
       resolve-identity-2565                 2s ago         never        0       no error                     
       resolve-identity-3437                 5s ago         never        0       no error                     
       restoring-ep-identity (3100)          6s ago         never        0       no error                     
       sync-endpoints-and-host-ips           6s ago         never        0       no error                     
       sync-lb-maps-with-k8s-services        6s ago         never        0       no error                     
       sync-policymap-3100                   2s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (2565)     1s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (3100)     6s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (3437)     5s ago         never        0       no error                     
       template-dir-watcher                  never          never        0       no error                     
     Proxy Status:            OK, ip 10.0.0.47, 0 redirects active on ports 10000-20000
     Global Identity Range:   min 256, max 65535
     Hubble:                  Ok   Current/Max Flows: 795/65535 (1.21%), Flows/s: 160.14   Metrics: Ok
     Encryption:              Disabled
     Cluster health:          0/2 reachable   (2022-12-01T20:11:12Z)
       Name                   IP              Node        Endpoints
       k8s2 (localhost)       192.168.56.12   reachable   unreachable
       k8s1                   192.168.56.11   reachable   unreachable

Stderr:
     Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)

20:11:18 STEP: Performing Cilium controllers preflight check
20:11:18 STEP: Performing Cilium status preflight check
20:11:18 STEP: Performing Cilium health check
20:11:18 STEP: Checking whether host EP regenerated
20:11:19 STEP: Performing Cilium service preflight check
20:11:19 STEP: Performing K8s service preflight check
20:11:20 STEP: Cilium is not ready yet: controllers are failing: cilium-agent 'cilium-57t87': controller ipcache-inject-labels is failing: Exitcode: 0 
Stdout:
     KVStore:                Ok   Disabled
     Kubernetes:             Ok   1.22 (v1.22.13) [linux/amd64]
     Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
     KubeProxyReplacement:   Disabled   
     Host firewall:          Disabled
     CNI Chaining:           none
     CNI Config file:        CNI configuration file management disabled
     Cilium:                 Ok   1.12.90 (v1.12.90-0f0b167e)
     NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
     IPAM:                   IPv4: 3/254 allocated from 10.0.0.0/24, IPv6: 3/254 allocated from fd02::/120
     IPv6 BIG TCP:           Disabled
     BandwidthManager:       Disabled
     Host Routing:           Legacy
     Masquerading:           IPTables [IPv4: Enabled, IPv6: Enabled]
     Controller Status:      20/21 healthy
       Name                                  Last success   Last error   Count   Message
       bpf-map-sync-cilium_lxc               8s ago         never        0       no error                     
       cilium-health-ep                      7s ago         never        0       no error                     
       dns-garbage-collector-job             10s ago        never        0       no error                     
       endpoint-2565-regeneration-recovery   never          never        0       no error                     
       endpoint-3100-regeneration-recovery   never          never        0       no error                     
       endpoint-3437-regeneration-recovery   never          never        0       no error                     
       endpoint-gc                           10s ago        never        0       no error                     
       ipcache-inject-labels                 never          9s ago       8       k8s cache not fully synced   
       k8s-heartbeat                         10s ago        never        0       no error                     
       link-cache                            8s ago         never        0       no error                     
       metricsmap-bpf-prom-sync              5s ago         never        0       no error                     
       resolve-identity-2565                 3s ago         never        0       no error                     
       resolve-identity-3437                 7s ago         never        0       no error                     
       restoring-ep-identity (3100)          8s ago         never        0       no error                     
       sync-endpoints-and-host-ips           8s ago         never        0       no error                     
       sync-lb-maps-with-k8s-services        8s ago         never        0       no error                     
       sync-policymap-3100                   3s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (2565)     3s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (3100)     8s ago         never        0       no error                     
       sync-to-k8s-ciliumendpoint (3437)     7s ago         never        0       no error                     
       template-dir-watcher                  never          never        0       no error                     
     Proxy Status:            OK, ip 10.0.0.47, 0 redirects active on ports 10000-20000
     Global Identity Range:   min 256, max 65535
     Hubble:                  Ok   Current/Max Flows: 795/65535 (1.21%), Flows/s: 160.14   Metrics: Ok
     Encryption:              Disabled
     Cluster health:          2/2 reachable   (2022-12-01T20:11:18Z)

Stderr:
     Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)

20:11:20 STEP: Performing Cilium controllers preflight check
20:11:20 STEP: Performing Cilium status preflight check
20:11:20 STEP: Performing Cilium health check
20:11:20 STEP: Checking whether host EP regenerated
20:11:21 STEP: Performing Cilium service preflight check
20:11:21 STEP: Performing K8s service preflight check
20:11:23 STEP: Waiting for cilium-operator to be ready
20:11:23 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
20:11:23 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
20:11:23 STEP: Waiting for hubble-relay to be ready
20:11:23 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=hubble-relay")
20:11:23 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=hubble-relay") => <nil>
20:11:23 STEP: Deleting namespace 202212012011k8sagenthubbletesthubbleobservetestl3l4flow
20:11:23 STEP: Creating namespace 202212012011k8sagenthubbletesthubbleobservetestl3l4flow
20:11:24 STEP: WaitforPods(namespace="202212012011k8sagenthubbletesthubbleobservetestl3l4flow", filter="-l zgroup=testapp")
20:11:29 STEP: WaitforPods(namespace="202212012011k8sagenthubbletesthubbleobservetestl3l4flow", filter="-l zgroup=testapp") => <nil>
=== Test Finished at 2022-12-01T20:11:29Z====
20:11:29 STEP: Running JustAfterEach block for EntireTestsuite K8sAgentHubbleTest Hubble Observe
FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:
level=error
===================== TEST FAILED =====================
20:11:29 STEP: Running AfterFailed block for EntireTestsuite K8sAgentHubbleTest Hubble Observe
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
     NAMESPACE                                                 NAME                              READY   STATUS    RESTARTS      AGE   IP              NODE   NOMINATED NODE   READINESS GATES
     202212012011k8sagenthubbletesthubbleobservetestl3l4flow   app1-6bf9bf9bd5-9v26g             2/2     Running   0             7s    10.0.1.146      k8s1   <none>           <none>
     202212012011k8sagenthubbletesthubbleobservetestl3l4flow   app1-6bf9bf9bd5-jmz4l             2/2     Running   0             7s    10.0.1.115      k8s1   <none>           <none>
     202212012011k8sagenthubbletesthubbleobservetestl3l4flow   app2-58757b7dd5-nrs8z             1/1     Running   0             7s    10.0.1.173      k8s1   <none>           <none>
     202212012011k8sagenthubbletesthubbleobservetestl3l4flow   app3-5d69599cdd-nlcvw             1/1     Running   0             7s    10.0.1.137      k8s1   <none>           <none>
     cilium-monitoring                                         grafana-5747bcc8f9-vvbvr          0/1     Running   0             45m   10.0.0.232      k8s1   <none>           <none>
     cilium-monitoring                                         prometheus-655fb888d7-8nvcj       1/1     Running   0             45m   10.0.0.41       k8s1   <none>           <none>
     kube-system                                               cilium-57t87                      1/1     Running   0             29s   192.168.56.12   k8s2   <none>           <none>
     kube-system                                               cilium-kqnt2                      1/1     Running   0             29s   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               cilium-operator-c4977bdc6-d46pm   1/1     Running   0             29s   192.168.56.12   k8s2   <none>           <none>
     kube-system                                               cilium-operator-c4977bdc6-tvjrf   1/1     Running   0             29s   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               coredns-69b675786c-dv6xk          1/1     Running   0             18m   10.0.1.87       k8s1   <none>           <none>
     kube-system                                               etcd-k8s1                         1/1     Running   0             50m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               hubble-relay-6f646854b5-hfnvn     1/1     Running   0             16s   10.0.0.159      k8s2   <none>           <none>
     kube-system                                               kube-apiserver-k8s1               1/1     Running   0             50m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               kube-controller-manager-k8s1      1/1     Running   5 (21m ago)   50m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               kube-proxy-79kqw                  1/1     Running   0             47m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               kube-proxy-kcn6c                  1/1     Running   0             46m   192.168.56.12   k8s2   <none>           <none>
     kube-system                                               kube-scheduler-k8s1               1/1     Running   4 (21m ago)   50m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               log-gatherer-hpccv                1/1     Running   0             45m   192.168.56.12   k8s2   <none>           <none>
     kube-system                                               log-gatherer-kmlfz                1/1     Running   0             45m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               registry-adder-4g8kg              1/1     Running   0             46m   192.168.56.11   k8s1   <none>           <none>
     kube-system                                               registry-adder-lj5g7              1/1     Running   0             46m   192.168.56.12   k8s2   <none>           <none>

Stderr:

Fetching command output from pods [cilium-57t87 cilium-kqnt2]
cmd: kubectl exec -n kube-system cilium-57t87 -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
     ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                  IPv6       IPv4         STATUS   
                ENFORCEMENT        ENFORCEMENT                                                                                                                       
     2565       Disabled           Disabled          9391       k8s:app.kubernetes.io/name=hubble-relay                                      fd02::61   10.0.0.159   ready   
                                                                k8s:app.kubernetes.io/part-of=cilium                                                                         
                                                                k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system                                   
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                     
                                                                k8s:io.cilium.k8s.policy.serviceaccount=hubble-relay                                                         
                                                                k8s:io.kubernetes.pod.namespace=kube-system                                                                  
                                                                k8s:k8s-app=hubble-relay                                                                                     
     3100       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                                           ready   
                                                                reserved:host                                                                                                
     3437       Disabled           Disabled          4          reserved:health                                                              fd02::ac   10.0.0.18    ready   

Stderr:

cmd: kubectl exec -n kube-system cilium-kqnt2 -c cilium-agent -- cilium endpoint list
Exitcode: 0 
Stdout:
     ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                                                              IPv6        IPv4         STATUS   
                ENFORCEMENT        ENFORCEMENT                                                                                                                                                                    
     15         Disabled           Disabled          4          reserved:health                                                                                                          fd02::149   10.0.1.214   ready   
     58         Disabled           Disabled          53184      k8s:appSecond=true                                                                                                       fd02::184   10.0.1.173   ready   
                                                                k8s:id=app2                                                                                                                                               
                                                                k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                    
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                                                                  
                                                                k8s:io.cilium.k8s.policy.serviceaccount=app2-account                                                                                                      
                                                                k8s:io.kubernetes.pod.namespace=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                                                   
                                                                k8s:zgroup=testapp                                                                                                                                        
     210        Disabled           Disabled          27555      k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system                                               fd02::1f2   10.0.1.87    ready   
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                                                                  
                                                                k8s:io.cilium.k8s.policy.serviceaccount=coredns                                                                                                           
                                                                k8s:io.kubernetes.pod.namespace=kube-system                                                                                                               
                                                                k8s:k8s-app=kube-dns                                                                                                                                      
     440        Disabled           Disabled          26319      k8s:id=app1                                                                                                              fd02::1d9   10.0.1.115   ready   
                                                                k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                    
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                                                                  
                                                                k8s:io.cilium.k8s.policy.serviceaccount=app1-account                                                                                                      
                                                                k8s:io.kubernetes.pod.namespace=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                                                   
                                                                k8s:zgroup=testapp                                                                                                                                        
     1271       Disabled           Disabled          26319      k8s:id=app1                                                                                                              fd02::15f   10.0.1.146   ready   
                                                                k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                    
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                                                                  
                                                                k8s:io.cilium.k8s.policy.serviceaccount=app1-account                                                                                                      
                                                                k8s:io.kubernetes.pod.namespace=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                                                   
                                                                k8s:zgroup=testapp                                                                                                                                        
     1364       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                                                                                                        ready   
                                                                k8s:node-role.kubernetes.io/control-plane                                                                                                                 
                                                                k8s:node-role.kubernetes.io/master                                                                                                                        
                                                                k8s:node.kubernetes.io/exclude-from-external-load-balancers                                                                                               
                                                                reserved:host                                                                                                                                             
     1721       Disabled           Disabled          33745      k8s:id=app3                                                                                                              fd02::1fa   10.0.1.137   ready   
                                                                k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                    
                                                                k8s:io.cilium.k8s.policy.cluster=default                                                                                                                  
                                                                k8s:io.cilium.k8s.policy.serviceaccount=default                                                                                                           
                                                                k8s:io.kubernetes.pod.namespace=202212012011k8sagenthubbletesthubbleobservetestl3l4flow                                                                   
                                                                k8s:zgroup=testapp                                                                                                                                        

Stderr:

===================== Exiting AfterFailed =====================
20:11:39 STEP: Running AfterEach for block EntireTestsuite

Resources

Anything else?

This looks very similar to #21175 with the difference being that #21175's error log is Cannot create CEP. It even confused MLH (see https://github.com/cilium/cilium/issues/21175#issuecomment-1341041949).

pchaigno commented 1 year ago

Workaround (fix?) being worked on at https://github.com/cilium/cilium/pull/22600.

pchaigno commented 1 year ago

Assigning @tommyp1ckles but that's just to reflect who is looking into it at the moment.

tommyp1ckles commented 1 year ago

I've traced down the logs of some of the cases of this happening. It appears that each one is the result of a Pod being deleted just prior Cilium restarting, seemingly related to cleaning up tests.

Example: pod/test-k8s2-7f96d84c65-9s5cb (in last example: a4563fb1_K8sAgentHubbleTest_Hubble_Observe_Test_L3-L4_Flow.zip

Only missing piece is where these CEPs are getting deleted, if the EndpointSyncronizer is failing to delete due to empty UID, and the CNI delete is failing, what’s removing the CEP? Probably the operator?

In other cases, the origin of the delete is clean (it is logged as being part of the test). Importantly, they all share the same failure to restore due to missing Pod.

So, however they're getting into that state, I think this looks ok as far as CEP cleanup doing its just. It is notable, that the CEP is still in the cache during the CEP cleanup, and because restore fails the cleanup attempts to delete CEPs that no longer exist.

tldr the unusual logs seem to be related to the close proximity of cleaning up old test pods and restart ciliium for a new test suite. In this case, we should make changes similar to Jarnos where we avoid logging a warning in that situation (info is fine).

pchaigno commented 1 year ago

Only missing piece is where these CEPs are getting deleted, if the EndpointSyncronizer is failing to delete due to empty UID, and the CNI delete is failing, what’s removing the CEP? Probably the operator?

Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?

So, however they're getting into that state, I think this looks ok as far as CEP cleanup doing its just. It is notable, that the CEP is still in the cache during the CEP cleanup, and because restore fails the cleanup attempts to delete CEPs that no longer exist.

I'm a bit confused. Are you saying the CEP doesn't exist because the endpoint couldn't be restored or it doesn't exist because it was deleted by the operator?

tldr the unusual logs seem to be related to the close proximity of cleaning up old test pods and restart ciliium for a new test suite. In this case, we should make changes similar to Jarnos where we avoid logging a warning in that situation (info is fine).

On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.

tommyp1ckles commented 1 year ago

Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?

:+1:

I'm a bit confused. Are you saying the CEP doesn't exist because the endpoint couldn't be restored or it doesn't exist because it was deleted by the operator?

By the time the restore happens, I know the Pod is missing (and I assume CEP), the endpoint is still in the restore state so it tries to restore it but fails because the Pod is missing.

At the same time, the CEP is still in the k8s cache but is not actually being managed, so it is cleaned up but fails because the CEP is not in API server.

On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.

Makes sense, maybe info in that case? since we're capturing the NotFound scenario separately.

pchaigno commented 1 year ago

On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.

Makes sense, maybe info in that case? since we're capturing the NotFound scenario separately.

I think you're better able to judge than me in this case.

In general, I'd consider warning cases only for when we know something is wrong but we don't expect it should affect normal operations. If it affected normal operations, it should be an error. If we are unsure something is wrong, it should be info, mostly because a warning for normal operations is very annoying to users as they can't really avoid it (and thus it tends to normalize warnings in logs).

tommyp1ckles commented 1 year ago

In general, I'd consider warning cases only for when we know something is wrong but we don't expect it should affect normal operations. If it affected normal operations, it should be an error. If we are unsure something is wrong, it should be info, mostly because a warning for normal operations is very annoying to users as they can't really avoid it (and thus it tends to normalize warnings in logs).

Yeah I tend to prefer info over warning, unless the impact is really unclear.

In this case, I would say that a no-op cleanup is expected behavior (in the case of the cep being missing already) and an info is sufficient.

tommyp1ckles commented 1 year ago

Updated PR here: https://github.com/cilium/cilium/pull/22600

pchaigno commented 1 year ago

But we still don't really know who deleted the CEP, right?

tommyp1ckles commented 1 year ago

If the Pod is missing at startup restore, then it must have been deleted (it's not like the CEP is just missing with the Pod sticking around). I'm just confused since I see PodSandboxErrors for missing CNI events with no successful delete.

Somehow, it seems like a legitimate sandbox delete went through...

The reason the CEP is still in the CES at startup is seemingly that the operator is stuck blocking on acquiring a lease, so it doesn't actually start running until a few minutes after this. So that explains why the cache has a CES with missing CEPs at least.

tommyp1ckles commented 1 year ago

I'll look at the method of teardown between these tests this afternoon, maybe there's something that's just not being capture in the logs.

tommyp1ckles commented 1 year ago

Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?

I don't think the operator is actually doing anything here, as its not actually running at that time.

tommyp1ckles commented 1 year ago

Looking through a few more instances of the warning:

2022-12-01T20:11:11.009769887Z level=debug msg="Error regenerating endpoint: BPF template compilation failed: failed to compile template program /var/run/cilium/state/templates/03972ae32602b2e12bcd141f504f54c66386382f8301f74d20ea4fedb2ca379d: Failed to compile bpf_lxc.dbg.o: Command execution failed for [llc -march=bpf -mcpu=v1 -filetype=obj -o /var/run/cilium/state/templates/03972ae32602b2e12bcd141f504f54c66386382f8301f74d20ea4fedb2ca379d/bpf_lxc.dbg.o]: context canceled" code=Failure containerID=599f17c102 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1764 endpointState=waiting-to-regenerate identity=54977 ipv4=10.0.0.156 ipv6=10.0.0.156 k8sPodName=default/echo-55fdf5787d-2jr79 policyRevision=0 subsys=endpoint type=200
...

^ Not sure why this happened, but it seems to have resulted in EP restore failing out.

2022-12-03T19:03:21.168642187Z level=debug msg="CEP deleted, calling endpointDeleted" CEPName=kube-system/coredns-69b675786c-ws4nv CES
Name=ces-bcn6s7dw7-wwby7 subsys=k8s-watcher
...
2022-12-03T19:03:23.138284852Z level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping." ciliumEndpointName=coredns-69b675786c-ws4nv error="ciliumendpoints.cilium.io \"coredns-69b675786c-ws4nv\" not found" k8sNamespace=kube-system subsys=daemon
  1       testclient-2-s76t8.172d5d1d962971a2
default             115s        Normal    Killing                   pod/testclient-2-s76t8                      spec.containers{web}  
                         kubelet, k8s1                               Stopping container 
...
2022-12-03T18:53:04.082902496Z level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping." ciliumEndpointName=testclient-2-s76t8 error="ciliumendpoints.cilium.io \"testclient-2-s76t8\" not found" k8sNamespace=default subsys=daemon

Looking through a few more cases in the test dumps, most instances some explanation for why the CEP is missing at the time of the cleanup. That previous case was the one where it was a bit unclear where exactly the delete happened.

The cleanup between tests is what's causing the situation to arise more often than you would expect under normal circumstances.

There's some interesting cases around some of the circumstances preceeding the log, but I don't see much of a pattern.

tommyp1ckles commented 1 year ago

PR was merged, closing for now.