Closed rlex closed 2 years ago
i also have rp-filter disabled explicitely at boot on all interfaces:
lex@node-1 ⇣⇡ ❯ sysctl net.ipv4.conf.all.rp_filter
net.ipv4.conf.all.rp_filter = 0
lex@node-1 ⇣⇡ ❯ sysctl net.ipv4.conf.lxc42edfebcb30f.rp_filter
net.ipv4.conf.lxc42edfebcb30f.rp_filter = 0
lex@node-1 ⇣⇡ ❯ sysctl net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.eth0.rp_filter = 0
lex@node-1 ⇣⇡ ❯ sysctl net.ipv4.conf.ens10.rp_filter
net.ipv4.conf.ens10.rp_filter = 0
somewhat similar issue in k3s repo https://github.com/k3s-io/k3s/issues/5188 but i had network policy disabled for quite some time, plus it affects outside connections, not agent as a whole
if you could provide sysdump or clear steps to reproduce the issue, it would be helpful, does the problem occur when you have fresh ubuntu 22.04 and fresh k3s install?
was happening on fresh install too. I'll try to create single-node k3s on same provider with same settings.
okay, fresh node with 1.23.7, crashing now. Where should i send sysdump?
when you add new comment, at the bottom where you can click and attach files. are you able to reproduce on single node k3s install? what provider your ubuntu VM is provided?
My VM provider is hetzner cloud. Pretty much bare install except typical /etc/hostname, mailer, etc tweaks
k3s config (goes to /etc/rancher/k3s/config.yaml):
#master only stuff
cluster-init: true
disable:
- metrics-server
- traefik
- servicelb
- coredns
flannel-backend: 'none'
cluster-cidr: 10.251.0.0/16
disable-cloud-controller: true
disable-kube-proxy: true
disable-network-policy: true
etcd-expose-metrics: true
kube-controller-manager-arg:
- bind-address=0.0.0.0
kube-proxy-arg:
- metrics-bind-address=0.0.0.0
kube-scheduler-arg:
- bind-address=0.0.0.0
kubelet-arg:
- cloud-provider=external
#generic stuff
node-external-ip: YOUR_EXTERNAL_IP
node-ip: YOUR_INTERNAL_IP
You can probably skip
disable-cloud-controller: true
kubelet-arg:
- cloud-provider=external
Since it's for provisioning via external cloud provider only and probably doesn't mean anything here since cilium have toleration for node.cloudprovider.kubernetes.io/uninitialized: true
No arguments passed to k3s binary, everything is going via config.
Helm values already provided in ticket.
Nothing is deployed except cilium and coredns (which fails to start because of cloudprovider taint)
sysdump attached too.
also checked that rp_filter applies correctly:
root@master-1:~# sysctl net.ipv4.conf.cilium_geneve.rp_filter
net.ipv4.conf.cilium_geneve.rp_filter = 0
root@master-1:~# sysctl net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.eth0.rp_filter = 0
root@master-1:~# sysctl net.ipv4.conf.enp7s0.rp_filter
net.ipv4.conf.enp7s0.rp_filter = 0
root@master-1:~# sysctl net.ipv4.conf.lxcdd738a58e10a.rp_filter
net.ipv4.conf.lxcdd738a58e10a.rp_filter = 0
the sysdump did not complete because cilium agent is not working I think
⚠️ cniconflist-cilium-44zfq: unable to upgrade connection: container not found ("cilium-agent")
⚠️ gops-cilium-44zfq-memstats: failed to list processes "cilium-44zfq" ("cilium-agent") in namespace "kube-system": unable to upgrade connection: container not found ("cilium-agent")
I have sysctl rp_filter setting in 22.04, can you try that before k3s install?
cat /etc/sysctl.d/99-override_cilium_rp_filter.conf
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.*.rp_filter = 0
rp_filter is disabled via my k3s ansible role, but i applied it manually just in case and nothing happened. It looks like 1.12.0-rc3 already handles this:
root@master-1:~# cat /etc/sysctl.d/99-zzz-override_cilium.conf
# Disable rp_filter on Cilium interfaces since it may cause mangled packets to be dropped
net.ipv4.conf.lxc*.rp_filter = 0
net.ipv4.conf.cilium_*.rp_filter = 0
# The kernel uses max(conf.all, conf.{dev}) as its value, so we need to set .all. to 0 as well.
# Otherwise it will overrule the device specific settings.
net.ipv4.conf.all.rp_filter = 0
Interesting that operator is crashing too.
by the way, this is how I install k3s in ubuntu VM 22.04 with two network interfaces, one interface is behind company proxy for internet connection (10.3.72.9), another network interface is for internal network (10.169.72.9), it works fine, I do need to have the rp_filter setting override configured before k3s install, otherwise, it won't work
curl -sfL https://get.k3s.io | INSTALL_K3S_SYMLINK=force INSTALL_K3S_VERSION='v1.24.1+k3s1' INSTALL_K3S_EXEC='--flannel-backend=none --node-ip=10.169.72.9 --node-external-ip=10.3.72.9 --disable=traefik --disable-kube-proxy --disable-network-policy --kube-apiserver-arg=kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname' sh -
cilium install
cilium install --version=v1.12.0-rc3 --kube-proxy-replacement strict --helm-set-string=k8sServiceHost=10.3.72.9,k8sServicePort=6443"
I noticed that i lack
net.ipv4.conf.default.rp_filter = 0
However, all interfaces still had rp_filter disabled.
Just to make sure i added it to 99-sysctl.conf and rebooted node, with no luck.
those are last messages in logs before cilium-agent gets restarted:
level=error msg="Command execution failed" cmd="[/var/lib/cilium/bpf/init.sh /var/lib/cilium/bpf /var/run/cilium/state /host/proc/sys/net /sys/class/net 10.251.0.78 <nil> tunnel geneve 6081 enp7s0;eth0 cilium_host cilium_net 1500 true true true /run/cilium/cgroupv2 /sys/fs/bpf true true v3 3 true true 1]" error="signal: killed" subsys=datapath-loader
level=fatal msg="Error while creating daemon" error="error while initializing daemon: failed while reinitializing datapath: Command execution failed for [/var/lib/cilium/bpf/init.sh /var/lib/cilium/bpf /var/run/cilium/state /host/proc/sys/net /sys/class/net 10.251.0.78 <nil> tunnel geneve 6081 enp7s0;eth0 cilium_host cilium_net 1500 true true true /run/cilium/cgroupv2 /sys/fs/bpf true true v3 3 true true 1]: context canceled" subsys=daemon
And another run:
level=debug msg="Skipping CiliumEndpoint update because it has no k8s pod name" containerID= controller="sync-to-k8s-ciliumendpoint (3605)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=3605 identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
level=debug msg="Controller func execution time: 112.803µs" name="sync-to-k8s-ciliumendpoint (3605)" subsys=controller uuid=35c229f3-e3fa-43b2-9821-f9a45d635ed8
level=debug msg="Controller func execution time: 181.832µs" name=metricsmap-bpf-prom-sync subsys=controller uuid=16644322-5264-426b-a5a0-a910880228b2
level=debug msg="Handling request for /healthz" subsys=health-server
level=debug msg="Controller func execution time: 2.705µs" name=bpf-map-sync-cilium_lxc subsys=controller uuid=09615ada-f6f2-4735-9a13-1b851dccef9c
level=debug msg="Controller func execution time: 3.106µs" name=bpf-map-sync-cilium_throttle subsys=controller uuid=0681e6d2-7353-4f88-b38a-d0f96e318edb
level=debug msg="Controller func execution time: 198.484µs" name=metricsmap-bpf-prom-sync subsys=controller uuid=16644322-5264-426b-a5a0-a910880228b2
level=debug msg="Handling request for /healthz" subsys=health-server
level=debug msg="Skip pod event using host networking" k8sNamespace=kube-system k8sPodName=cilium-operator-6694d646b8-6vqh9 new-hostIP=10.31.0.2 new-podIP=10.31.0.2 new-podIPs="[{10.31.0.2}]" old-hostIP=10.31.0.2 old-podIP=10.31.0.2 old-podIPs="[{10.31.0.2}]" subsys=k8s-watcher
level=debug msg="Kubernetes service definition changed" action=service-updated endpoints="10.31.0.2:6942/TCP" k8sNamespace=kube-system k8sSvcName=cilium-operator old-service=nil service="frontends:[]/ports=[metrics]/selector=map[io.cilium/app:operator name:cilium-operator]" subsys=k8s-watcher
level=debug msg="Upserting IP into ipcache layer" identity="{host kube-apiserver false}" ipAddr=95.217.22.103 key=0 subsys=ipcache
level=debug msg="Daemon notified of IP-Identity cache state change" identity="{host kube-apiserver false}" ipAddr="{95.217.22.103 ffffffff}" modification=Upsert subsys=datapath-ipcache
level=debug msg="Upserting IP into ipcache layer" identity="{host local false}" ipAddr=10.31.0.2 key=0 subsys=ipcache
level=debug msg="Daemon notified of IP-Identity cache state change" identity="{host local false}" ipAddr="{10.31.0.2 ffffffff}" modification=Upsert subsys=datapath-ipcache
level=debug msg="Upserting IP into ipcache layer" identity="{host local false}" ipAddr=10.251.0.119 key=0 subsys=ipcache
level=debug msg="Daemon notified of IP-Identity cache state change" identity="{host local false}" ipAddr="{10.251.0.119 ffffffff}" modification=Upsert subsys=datapath-ipcache
level=debug msg="Upserting IP into ipcache layer" identity="{world local false}" ipAddr=0.0.0.0/0 key=0 subsys=ipcache
level=debug msg="Daemon notified of IP-Identity cache state change" identity="{world local false}" ipAddr="{0.0.0.0 00000000}" modification=Upsert subsys=datapath-ipcache
level=debug msg="Controller func execution time: 1.352153ms" name=sync-endpoints-and-host-ips subsys=controller uuid=9a459bd7-d181-43a0-b398-3803a12e64fd
level=debug msg="Skipping CiliumEndpoint update because it has no k8s pod name" containerID= controller="sync-to-k8s-ciliumendpoint (2983)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=2983 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
level=debug msg="Controller func execution time: 87.995µs" name="sync-to-k8s-ciliumendpoint (2983)" subsys=controller uuid=2bb957f8-4912-4733-addf-959638d8675e
level=debug msg="Skipping CiliumEndpoint update because it has not changed" containerID=d23d499354 controller="sync-to-k8s-ciliumendpoint (1825)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=1825 identity=40815 ipv4=10.251.0.233 ipv6= k8sPodName=kube-system/coredns-655b9bc459-fpmp9 subsys=endpointsynchronizer
level=debug msg="Controller func execution time: 117.291µs" name="sync-to-k8s-ciliumendpoint (1825)" subsys=controller uuid=d1f92af6-964d-4c22-a169-7b477de0280b
level=debug msg="Controller func execution time: 1.813µs" name=bpf-map-sync-cilium_lxc subsys=controller uuid=09615ada-f6f2-4735-9a13-1b851dccef9c
level=debug msg="Controller func execution time: 388.791µs" name=link-cache subsys=controller uuid=83be8b7c-d5f4-4882-970e-eea8d973114a
level=debug msg="Controller func execution time: 2.445µs" name=bpf-map-sync-cilium_throttle subsys=controller uuid=0681e6d2-7353-4f88-b38a-d0f96e318edb
level=debug msg="Controller func execution time: 1.341742ms" name=cilium-health-ep subsys=controller uuid=3fe75ff7-ba22-4f8a-b7d8-3c04a8ec3dfb
level=debug msg="Skipping CiliumEndpoint update because it has no k8s pod name" containerID= controller="sync-to-k8s-ciliumendpoint (3605)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=3605 identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
level=debug msg="Controller func execution time: 174.97µs" name="sync-to-k8s-ciliumendpoint (3605)" subsys=controller uuid=35c229f3-e3fa-43b2-9821-f9a45d635ed8
level=debug msg="Controller func execution time: 270.538µs" name=metricsmap-bpf-prom-sync subsys=controller uuid=16644322-5264-426b-a5a0-a910880228b2
level=info msg="Exiting due to signal" signal=terminated subsys=daemon
level=debug msg="canceling context in signal handler" subsys=daemon
level=info msg="Shutting down... " subsys=health-server
level=info msg="HTTP server Shutdown: context deadline exceeded" subsys=health-server
level=debug msg="Killing old health endpoint process" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
level=info msg="Stopped serving cilium health API at unix:///var/run/cilium/health.sock" subsys=health-server
level=debug msg="Killed endpoint process" pid=522 pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
level=info msg="Shutting down... " subsys=daemon
level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=cilium_health
level=info msg="HTTP server Shutdown: context deadline exceeded" subsys=daemon
level=info msg="Stopped serving cilium API at unix:///var/run/cilium/cilium.sock" subsys=daemon
level=debug msg="exiting retrying regeneration goroutine due to endpoint being deleted" containerID= datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=2983 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=debug msg="Controller func execution time: 1m1.51729468s" name=endpoint-2983-regeneration-recovery subsys=controller uuid=f35246b8-5ddd-4c46-b1ca-dd804fcc6987
level=debug msg="exiting retrying regeneration goroutine due to endpoint being deleted" containerID=d23d499354 datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=1825 identity=40815 ipv4=10.251.0.233 ipv6= k8sPodName=kube-system/coredns-655b9bc459-fpmp9 subsys=endpoint
level=debug msg="Controller run succeeded; waiting for next controller update or stop" name=endpoint-2983-regeneration-recovery subsys=controller uuid=f35246b8-5ddd-4c46-b1ca-dd804fcc6987
level=debug msg="Controller func execution time: 1m1.516447469s" name=endpoint-1825-regeneration-recovery subsys=controller uuid=a292e931-75be-43fb-888e-d69372ea10b9
level=debug msg="Controller run succeeded; waiting for next controller update or stop" name=endpoint-1825-regeneration-recovery subsys=controller uuid=a292e931-75be-43fb-888e-d69372ea10b9
level=debug msg="Process exited" cmd="ip [netns exec cilium-health cilium-health-responder --listen 4240 --pidfile /var/run/cilium/state/health-endpoint.pid]" exitCode="signal: killed" subsys=launcher
level=info msg="Waiting for all endpoints' go routines to be stopped." subsys=daemon
level=debug msg="stopping EventQueue" name=endpoint-2983 subsys=eventqueue
level=debug msg="stopping EventQueue" name=endpoint-1825 subsys=eventqueue
level=debug msg="stopping EventQueue" name=endpoint-3605 subsys=eventqueue
level=info msg="All endpoints' goroutines stopped." subsys=daemon
Can some of custom parameters cause crashloop? bpf masquerade? DSR? Geneve tunneling?
could be, I did notice you have quite a few custom settings, can you install k3s with curl and cilium with cilium-cli the way I installed for just testing to see if it crashes?
Yep, went green. No crashes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.
Is there an existing issue for this?
What happened?
Cilium 1.12.0-rc3 (atm), ubuntu 22.04, k3s v1.23.6+k3s1
As soon as i update k3s to 1.23.7 or 1.24 cilium-agent starts to crashloop.
Here is relevant slack thread just in case: https://cilium.slack.com/archives/C1MATJ5U5/p1655243973200179
I can easily reproduce it by upgrading one of nodes to affected version (anything newer than 1.23.6) - cilium agent fails to start. Downgrading back fixes it instantly.
My values for cilium:
one of Interesting parts of logs are:
Might be related to https://github.com/cilium/cilium/issues/8595 ? But it's pretty old one.
Cilium Version
cilium-cli: 0.11.10 compiled with go1.18.3 on darwin/arm64 cilium image (default): v1.11.6 cilium image (stable): v1.11.6 cilium image (running): v1.12.0-rc3
Also tried with stable.
Kernel Version
Kubernetes Version
Client Version: v1.24.1 Kustomize Version: v4.5.4 Server Version: v1.23.6+k3s1
Sysdump
No response
Relevant log output
Anything else?
No response
Code of Conduct