Open xunholy opened 2 years ago
Thanks for the report! Could you please share a Cilium sysdump?
I recently bumped into this with a GKE cluster. sysdump attached.
I used the cli to install.
Kernel is 5.10, using the GKE COS.
I swapped my GKE cluster from COS to Ubuntu and have made progress.
Just hit this issue on EKS
/¯¯\__/¯¯\ Cilium: 4 errors
\__/¯¯\__/ Operator: 1 errors
/¯¯\__/¯¯\ Hubble: disabled
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 2, Ready: 2/2, Available: 2/2
Deployment cilium-operator Desired: 1, Unavailable: 1/1
Containers: cilium Running: 2
cilium-operator Running: 1
Cluster Pods: 2/2 managed by Cilium
Image versions cilium quay.io/cilium/cilium-service-mesh:v1.11.0-beta.1@sha256:4252b95ce4d02f5b772fd7756d240e3c036e6c9a19e3d77bae9c3fa31c837e50: 2
cilium-operator quay.io/cilium/operator-generic-service-mesh:v1.11.0-beta.1@sha256:dcf364d807e26bc3a62fc8190e6ca40b40e9fceb71c7a934e34cbf24d5a9bfa8: 1
Errors: cilium cilium-glppv controller cilium-health-ep is failing since 25s (41x): Get "http://192.168.73.203:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium cilium-mks64 controller endpoint-2857-regeneration-recovery is failing since 1m35s (46x): regeneration recovery failed
cilium cilium-mks64 controller endpoint-260-regeneration-recovery is failing since 1m35s (46x): regeneration recovery failed
cilium cilium-mks64 controller cilium-health-ep is failing since 24s (41x): Get "http://192.168.3.52:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium-operator cilium-operator 1 pods of Deployment cilium-operator are not ready
Community report that deleting some of the pinned bpf programs on every node with sudo rm -rf /sys/fs/bpf/tc//globals/*
after uninstall and before reinstall will help
I forgot to provide my values as I'm installing it from the helm chart on the service mesh branch; @pchaigno if you can vet this looks accurate, I did have some questions wrt using autoDirectNodeRoutes
and tunnel: disabled
but again I wasn't sure if that would be the case for baremetal clusters.
image:
repository: quay.io/cilium/cilium-service-mesh
tag: v1.11.0-beta.1
useDigest: false
extraConfig:
enable-envoy-config: "true"
# -- Enable installation of PodCIDR routes between worker
# nodes if worker nodes share a common L2 network segment.
autoDirectNodeRoutes: true
# Cilium leverages MetalLB's simplified BGP announcement system for service type: LoadBalancer
bgp:
enabled: false
announce:
loadbalancerIP: true
nodePort:
# -- Enable the Cilium NodePort service implementation.
enabled: true
# -- Port range to use for NodePort services.
range: "30000,32767"
containerRuntime:
integration: containerd
endpointRoutes:
# -- Enable use of per endpoint routes instead of routing via
# the cilium_host interface.
enabled: false
# -- Enables masquerading of IPv4 traffic leaving the node from endpoints.
enableIPv4Masquerade: true
# -- Enables masquerading of IPv6 traffic leaving the node from endpoints.
enableIPv6Masquerade: true
# masquerade enables masquerading of traffic leaving the node for
# destinations outside of the cluster.
masquerade: true
hubble:
# -- Enable Hubble (true by default).
enabled: true
# Enables the provided list of Hubble metrics.
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- port-distribution
- icmp
- http
listenAddress: ':4244'
relay:
# -- Enable Hubble Relay (requires hubble.enabled=true)
enabled: true
image:
repository: quay.io/cilium/hubble-relay-service-mesh
tag: v1.11.0-beta.1
useDigest: false
# -- Roll out Hubble Relay pods automatically when configmap is updated.
rollOutPods: true
ui:
# -- Whether to enable the Hubble UI.
enabled: true
# -- Roll out Hubble-ui pods automatically when configmap is updated.
rollOutPods: true
ipam:
# -- Configure IP Address Management mode.
# ref: https://docs.cilium.io/en/stable/concepts/networking/ipam/
mode: "kubernetes"
operator:
# -- Deprecated in favor of ipam.operator.clusterPoolIPv4PodCIDRList.
# IPv4 CIDR range to delegate to individual nodes for IPAM.
clusterPoolIPv4PodCIDR: "10.0.0.0/8"
# -- IPv4 CIDR list range to delegate to individual nodes for IPAM.
clusterPoolIPv4PodCIDRList: ["10.0.0.0/8"]
# -- IPv4 CIDR mask size to delegate to individual nodes for IPAM.
clusterPoolIPv4MaskSize: 24
# -- Deprecated in favor of ipam.operator.clusterPoolIPv6PodCIDRList.
# IPv6 CIDR range to delegate to individual nodes for IPAM.
clusterPoolIPv6PodCIDR: "fd00::/104"
# -- IPv6 CIDR list range to delegate to individual nodes for IPAM.
clusterPoolIPv6PodCIDRList: ["fd00::/104"]
# -- IPv6 CIDR mask size to delegate to individual nodes for IPAM.
clusterPoolIPv6MaskSize: 120
ipv6:
# -- Enable IPv6 support.
enabled: false
# kubeProxyReplacement enables kube-proxy replacement in Cilium BPF datapath
# Disabled due to RockPi kernel <= 4.4
# Valid options are "disabled", "probe", "partial", "strict".
# ref: https://docs.cilium.io/en/stable/gettingstarted/kubeproxy-free/
kubeProxyReplacement: strict
# kubeProxyReplacement healthz server bind address
# To enable set the value to '0.0.0.0:10256' for all ipv4
# addresses and this '[::]:10256' for all ipv6 addresses.
# By default it is disabled.
# Can't be used as RockPi Kernel is <=4.4
kubeProxyReplacementHealthzBindAddr: '0.0.0.0:10256'
# prometheus enables serving metrics on the configured port at /metrics
# Enables metrics for cilium-agent.
prometheus:
enabled: true
port: 9090
# This requires the prometheus CRDs to be available (see https://github.com/prometheus-operator/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml)
serviceMonitor:
enabled: false
operator:
image:
repository: quay.io/cilium/operator
tag: v1.11.0-beta.1
useDigest: false
suffix: "-service-mesh"
# -- Roll out cilium-operator pods automatically when configmap is updated.
rollOutPods: true
# Enables metrics for cilium-operator.
prometheus:
enabled: true
serviceMonitor:
enabled: false
# kubeConfigPath: ~/.kube/config
k8sServiceHost: 192.168.1.205
k8sServicePort: 6443
# -- Specify the IPv4 CIDR for native routing (ie to avoid IP masquerade for).
# This value corresponds to the configured cluster-cidr.
# Deprecated in favor of ipv4NativeRoutingCIDR, will be removed in 1.12.
nativeRoutingCIDR: 10.0.0.0/8
# -- Specify the IPv4 CIDR for native routing (ie to avoid IP masquerade for).
# This value corresponds to the configured cluster-cidr.
ipv4NativeRoutingCIDR: 10.0.0.0/8
# tunnel is the encapsulation configuration for communication between nodes
tunnel: disabled
# loadBalancer is the general configuration for service load balancing
loadBalancer:
# algorithm is the name of the load balancing algorithm for backend
# selection e.g. random or maglev
algorithm: maglev
# mode is the operation mode of load balancing for remote backends
# e.g. snat, dsr, hybrid
# https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/#hybrid-dsr-and-snat-mode
# Fixes UDP Client Source IP Preservation for Local traffic
mode: hybrid
# disableEnvoyVersionCheck removes the check for Envoy, which can be useful on
# AArch64 as the images do not currently ship a version of Envoy.
disableEnvoyVersionCheck: false
cluster:
# -- Name of the cluster. Only required for Cluster Mesh.
name: default
# -- (int) Unique ID of the cluster. Must be unique across all connected
# clusters and in the range of 1 to 255. Only required for Cluster Mesh.
id:
clustermesh:
# -- Deploy clustermesh-apiserver for clustermesh
useAPIServer: true
apiserver:
# -- Clustermesh API server image.
image:
repository: quay.io/cilium/clustermesh-apiserver
tag: v1.11.2
etcd:
# -- Clustermesh API server etcd image.
image:
repository: quay.io/coreos/etcd
tag: v3.5.2
pullPolicy: IfNotPresent
# -- Roll out cilium agent pods automatically when configmap is updated.
rollOutCiliumPods: true
externalIPs:
# -- Enable ExternalIPs service support.
enabled: true
hostPort:
# -- Enable hostPort service support.
enabled: false
# -- Configure ClusterIP service handling in the host namespace (the node).
hostServices:
# -- Enable host reachable services.
enabled: true
# -- Supported list of protocols to apply ClusterIP translation to.
protocols: tcp,udp
Is there an existing issue for this?
What happened?
I have upgraded an existing instance of cilium in a cluster to use the new service mesh images and settings; To do this I did the following:
This produced a successful install helm output. I then subsequently ran a
cilium status
and get the following errors:This agent output continues for all agents ofcourse.
Checking the agent logs I can see it flooded with the following:
These logs are somewhat repetitive so I've only grabbed the last few entries.
Cilium Version
v1.11.0
Kernel Version
Linux k8s-controlplane-01 5.11.0-1007-raspi #7-Ubuntu SMP PREEMPT Wed Apr 14 22:08:05 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
It's worth mentioning this is the oldest kernel version being used on any of the fleet of nodes.
Kubernetes Version
v1.22.2
Sysdump
ip
Relevant log output
No response
Anything else?
I started by initially doing a
helm upgrade ...
then performed an uninstall using helm and the CLI to ensure all things were removed and attempted a fresh install however, ended with the same results.I also attempted a rollback however, it now continually appears to have this issue of not entirely starting - I'm not sure what the state of impact is with these logs as services continue to start and be routable through previous methods.
Code of Conduct