Closed Xaelias closed 6 months ago
Hmm... That's pretty odd. I agree that you should be seeing some kind of error.
The health of the pod is reported by the health check controller, this is what changes the status of the /healthz
endpoint and reports it back to the kubelet that checks it to see if it's healthy. If it is not healthy, I would expect an error in the logs like you can see here: https://github.com/cloudnativelabs/kube-router/blob/master/pkg/healthcheck/health_controller.go#L139-L181
I would try increasing all of the health checks (especially initialDelaySeconds
) and see if you can get an error out of the health check controller about what is not healthy. You could also try increasing the verbosity of the logging with the -v
flag and see if that gets you anything.
ngl I tried increasing the log level by feeding... things to -v
and it was unhappy with any value I tried xD but I'm sure it was a pebcak type of thing. I'll try again while I attempt your suggestion.
thx!
[EDIT] re: log level - that was definitely a pebcak I don't know what I was doing...
Again I don't see anything obviously critical here. Some things about the queue being full but I doubt this is the cause here?
[EDIT1] Even increasing the log level further the end looks like this:
I0513 00:46:44.186486 55210 health_controller.go:229] Health controller tick
time="2024-05-13T00:46:46Z" level=info msg="Peer Up" Key=10.19.90.1 State=BGP_FSM_OPENCONFIRM Topic=Peer
time="2024-05-13T00:46:46Z" level=info msg="sync finished" Key=10.19.90.1 Topic=Server
I0513 00:46:49.184889 55210 health_controller.go:229] Health controller tick
I0513 00:46:54.187385 55210 health_controller.go:229] Health controller tick
I0513 00:46:59.186868 55210 health_controller.go:229] Health controller tick
I0513 00:47:04.187371 55210 health_controller.go:229] Health controller tick
I0513 00:47:09.186820 55210 health_controller.go:229] Health controller tick
I0513 00:47:14.186786 55210 health_controller.go:229] Health controller tick
I0513 00:47:19.184962 55210 health_controller.go:229] Health controller tick
I0513 00:47:24.187327 55210 health_controller.go:229] Health controller tick
I0513 00:47:29.186696 55210 health_controller.go:229] Health controller tick
I0513 00:47:34.187140 55210 health_controller.go:229] Health controller tick
I0513 00:47:39.183448 55210 health_controller.go:229] Health controller tick
I0513 00:47:39.226792 55210 route_sync.go:41] Running local route table synchronization
I0513 00:47:44.182920 55210 health_controller.go:229] Health controller tick
I0513 00:47:49.185422 55210 health_controller.go:229] Health controller tick
I0513 00:47:54.186332 55210 health_controller.go:229] Health controller tick
I0513 00:47:59.186574 55210 health_controller.go:229] Health controller tick
I0513 00:48:04.187059 55210 health_controller.go:229] Health controller tick
I0513 00:48:09.183142 55210 health_controller.go:229] Health controller tick
I0513 00:48:14.186812 55210 health_controller.go:229] Health controller tick
I0513 00:48:19.185447 55210 health_controller.go:229] Health controller tick
I0513 00:48:24.186886 55210 health_controller.go:229] Health controller tick
I0513 00:48:29.187283 55210 health_controller.go:229] Health controller tick
I0513 00:48:34.185537 55210 health_controller.go:229] Health controller tick
I0513 00:48:39.183205 55210 health_controller.go:229] Health controller tick
I0513 00:48:39.226455 55210 route_sync.go:41] Running local route table synchronization
I0513 00:48:44.183011 55210 health_controller.go:229] Health controller tick
I0513 00:48:49.183513 55210 health_controller.go:229] Health controller tick
I0513 00:48:54.184193 55210 health_controller.go:229] Health controller tick
I0513 00:48:59.183682 55210 health_controller.go:229] Health controller tick
I0513 00:49:04.183513 55210 health_controller.go:229] Health controller tick
I0513 00:49:09.184137 55210 health_controller.go:229] Health controller tick
I0513 00:49:14.182913 55210 health_controller.go:229] Health controller tick
I0513 00:49:19.182591 55210 health_controller.go:229] Health controller tick
I0513 00:49:24.184197 55210 health_controller.go:229] Health controller tick
I0513 00:49:29.183680 55210 health_controller.go:229] Health controller tick
I0513 00:49:32.538177 55210 kube-router.go:266] Shutting down the controllers
I0513 00:49:32.538278 55210 health_controller.go:208] Shutting down health controller
I0513 00:49:32.538285 55210 nodeport_healthcheck.go:170] Stopping all NodePort health checks
I0513 00:49:32.538307 55210 nodeport_healthcheck.go:172] Waiting for all NodePort health checks to finish shutting down
I0513 00:49:32.538323 55210 nodeport_healthcheck.go:174] All NodePort health checks are completely shut down, all done!
I0513 00:49:32.538357 55210 network_services_controller.go:338] Shutting down network services controller
I0513 00:49:32.538398 55210 network_policy_controller.go:189] Shutting down network policies full sync goroutine
E0513 00:49:32.538387 55210 health_controller.go:199] Health controller error: http: Server closed
I0513 00:49:32.538419 55210 health_controller.go:224] Shutting down HealthController RunCheck
I0513 00:49:32.538456 55210 network_routes_controller.go:407] Shutting down network routes controller
I0513 00:49:32.538460 55210 route_sync.go:64] Shutting down local route synchronization
I0513 00:49:32.538458 55210 network_policy_controller.go:204] Shutting down network policies controller
[EDIT2]
curl localhost:20244/healthz
returns OK
until the very moment the pod cycles. Which I did not expect.
hmm... That is pretty odd. Looking back up at your event log, I can see now that it actually says:
Warning Unhealthy 11m kubelet Liveness probe failed: Get "http://10.19.91.100:20244/healthz": dial tcp 10.19.91.100:20244: connect: connection refused
So essentially kubelet is saying that its unable to reach that endpoint. I'm assuming that 10.19.91.100
is the node's IP since kube-router should be run within the host network. I would next try increasing the number of liveness checks that Kubernetes has until it kills the service (failureThreshold
) and then try to get the exact endpoint that is failing via kubectl describe pod -n kube-system <kube-router-pod-here>
and then curling that (from the host where kube-router is running, e.g. the same path that kubelet is using) and seeing if you get a different response?
I'm going to assume that you'll get connection refused or something similar, so then it is going to be more about checking the location of the kube-router IP address and trying to understand if there is something about the system setup that is preventing that from working? Maybe additional system firewall rules or something?
As a very last resort, you could of course run without any liveness checks. This gets rid of a core component of the Kubernetes stability guarantees, but it looks like kube-router is a 100% healthy from all of your logs.
You are correct that is the IP of that node.
The thing is that event is not even consistently in there. The firewall is also disabled on ubuntu (ufw status
returns inactive, not that it should matter for local stuff). The curl I mentioned above was run from the host itself. I'm at a loss here...
I would consider disabling the liveness probe if kube-router was working as is. But as far as I can tell the DNS/bgp part is also not working properly since I can't resolve things like kubernetes.default.svc.cluster.local
or kube-dns.kube-system.svc.cluster.local
unless I dig ... @10.201.0.10
(from that node alone, from another machine it doesn't work since the routes don't seem to be propagated properly).
Note, that kubelet is not curling localhost to get the health of kube-router, it is using the IP of the interface to the node, so that may be a difference worth looking into on your node. ufw
showing inactive, would seem to rule out that as an issue in the first-pass, but there may be other things playing with firewall rules or the like.
One other thing that I noticed, is that you don't seem to be setting --service-cluster-ip-range
(see: https://www.kube-router.io/docs/user-guide/#command-line-options for more details) and that your ClusterIP at least for DNS appears to be outside the k8s default range of 10.96.0.0/12
so that may also be causing you problems. I would recommend setting that flag for your cluster as a parameter to kube-router.
So with a cluster size of 1 I don't believe that BGP is going to play a part in the service availability. Within kube-router BGP is only used to propagate routes between multiple nodes within a cluster, or to peer with external peers. I see that you have defined a peer via annotations which should be working, from the logs it looks like a route was established:
kube-router-qx29r kube-router 2024-05-12T19:29:04.885885804-05:00 time="2024-05-13T00:29:04Z" level=info msg="Peer Up" Key=10.19.90.1 State=BGP_FSM_OPENCONFIRM Topic=Peer
kube-router-qx29r kube-router 2024-05-12T19:29:04.885926731-05:00 time="2024-05-13T00:29:04Z" level=info msg="sync finished" Key=10.19.90.1 Topic=Server
DNS is a whole different thing, that is handled by CoreDNS or kube-dns or some other application within the cluster, kube-router only provides the route to the service, so since a dig of the IP address is working, then I would suggest that the kube-router part is working.
If you can provide more information on what specifically is not working with the BGP peering or what services are not available we might be able to dig more into that, but otherwise, from everything I can see kube-router appears to be working and healthy within the cluster.
Hi, thanks for pointing out I was missing the service bit :-)
After another couple install from scratch (I had a few issues with kubernetes absolutely freaking out as I was trying some things).
Anyway, I have a new data point. kube-router
only fails as soon as I add --advertise-cluster-ip
. So there does seem to be something not clicking with the bgp part of my config :-/
args:
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --bgp-graceful-restart=true
- --kubeconfig=/var/lib/kube-router/kubeconfig
- --service-cluster-ip-range=10.201.0.0/16
- --peer-router-ips=10.19.90.1
- --peer-router-asns=64513
- --advertise-cluster-ip
- --advertise-external-ip
- --advertise-loadbalancer-ip
- --advertise-pod-cidr
- --nodes-full-mesh
- -v=10
Here is what I'm currently trying with.
[EDIT] I'm not getting consistent behavior from kube-router. Now it seems it's willing to crash even with just the router ip/asn set.
From what I can tell, kube-router isn't crashing, at least not in any of the logs that you've sent so far, it seems to be being killed.
What happened? Hi. I am in the process of re-installing my homelab k8s cluster from scratch (it was outdated in many ways) and I had kube-router working, but I can't seem to figure out how to set it up again. It keeps crashing (failing liveness probe) without any obvious cause I could find. I'm also trying to enable BGP (which I had setup previously) but that doesn't seem to be the route cause.
What did you expect to happen? I expect kube-router to work, and BGP w/ my firewall to work.
How can we reproduce the behavior you experienced? Steps to reproduce the behavior:
(I also tried to set this in the yaml manifest without any change, plus this seem to work)
System Information (please complete the following information):
kube-router --version
): Running kube-router version v2.1.1, built on 2024-04-27T21:51:12+0000, go1.21.7kubectl version
) : 1.30.0Logs, other output, metrics
logs
``` kube-router-gcc7j kube-router 2024-05-12T13:28:03.947391739-05:00 I0512 18:28:03.947202 19577 version.go:66] Running /usr/local/bin/kube-router version v2.1.1, built on 2024-04-27T21:51:12+0000, go1.21.7 kube-router-gcc7j kube-router 2024-05-12T13:28:04.049544598-05:00 I0512 18:28:04.049344 19577 kube-router.go:137] Metrics port must be over 0 and under 65535 in order to be enabled, given port: 0 kube-router-gcc7j kube-router 2024-05-12T13:28:04.049569301-05:00 I0512 18:28:04.049408 19577 kube-router.go:139] Disabling metrics for kube-router, set --metrics-port properly in order to enable kube-router-gcc7j kube-router 2024-05-12T13:28:04.060999054-05:00 I0512 18:28:04.060783 19577 network_routes_controller.go:1643] Could not find annotation `kube-router.io/bgp-local-addresses` on node object so BGP will listen on node IP: [10.19.91.100] addresses. kube-router-gcc7j kube-router 2024-05-12T13:28:04.086103370-05:00 I0512 18:28:04.085927 19577 network_services_controller.go:241] Starting network services controller kube-router-gcc7j kube-router 2024-05-12T13:28:04.090012726-05:00 I0512 18:28:04.089824 19577 network_routes_controller.go:267] Setting MTU of kube-bridge interface to: 1500 kube-router-gcc7j kube-router 2024-05-12T13:28:04.091856484-05:00 I0512 18:28:04.091690 19577 network_routes_controller.go:315] Starting network route controller kube-router-gcc7j kube-router 2024-05-12T13:28:04.096425450-05:00 I0512 18:28:04.096280 19577 network_routes_controller.go:1297] Could not find BGP peer password info in the node's annotations. Assuming no passwords. kube-router-gcc7j kube-router 2024-05-12T13:28:04.096439186-05:00 I0512 18:28:04.096324 19577 network_routes_controller.go:1314] Could not find BGP peer local ip info in the node's annotations. Assuming node IP. kube-router-gcc7j kube-router 2024-05-12T13:28:04.096834679-05:00 time="2024-05-12T18:28:04Z" level=info msg="Add a peer configuration" Key=10.19.90.1 Topic=Peer kube-router-gcc7j kube-router 2024-05-12T13:28:04.106684737-05:00 I0512 18:28:04.106513 19577 bgp_policies.go:772] Did not match any existing policies, starting import policy with default name: kube_router_export0 kube-router-gcc7j kube-router 2024-05-12T13:28:04.106698798-05:00 I0512 18:28:04.106538 19577 bgp_policies.go:781] Current policy does not appear to match new policy: kube_router_export - creating new policy kube-router-gcc7j kube-router 2024-05-12T13:28:04.106701891-05:00 I0512 18:28:04.106607 19577 bgp_policies.go:787] Ensuring that policy kube_router_export1 is assigned kube-router-gcc7j kube-router 2024-05-12T13:28:04.106953321-05:00 I0512 18:28:04.106877 19577 bgp_policies.go:929] Did not match any existing policies, starting import policy with default name: kube_router_import0 kube-router-gcc7j kube-router 2024-05-12T13:28:04.106959598-05:00 I0512 18:28:04.106913 19577 bgp_policies.go:938] Current policy does not appear to match new policy: kube_router_import - creating new policy kube-router-gcc7j kube-router 2024-05-12T13:28:04.107037775-05:00 I0512 18:28:04.106976 19577 bgp_policies.go:944] Ensuring that policy kube_router_import1 is assigned kube-router-gcc7j kube-router 2024-05-12T13:28:04.145833414-05:00 I0512 18:28:04.145614 19577 network_policy_controller.go:163] Starting network policy controller kube-router-gcc7j kube-router 2024-05-12T13:28:04.200385691-05:00 I0512 18:28:04.200186 19577 network_policy_controller.go:175] Starting network policy controller full sync goroutine kube-router-gcc7j kube-router 2024-05-12T13:28:11.100230681-05:00 time="2024-05-12T18:28:11Z" level=info msg="Peer Up" Key=10.19.90.1 State=BGP_FSM_OPENCONFIRM Topic=Peer kube-router-gcc7j kube-router 2024-05-12T13:28:11.100276476-05:00 time="2024-05-12T18:28:11Z" level=info msg="sync finished" Key=10.19.90.1 Topic=Server kube-router-gcc7j kube-router 2024-05-12T13:29:25.538503329-05:00 I0512 18:29:25.538282 19577 kube-router.go:266] Shutting down the controllers kube-router-gcc7j kube-router 2024-05-12T13:29:25.538559204-05:00 I0512 18:29:25.538430 19577 nodeport_healthcheck.go:170] Stopping all NodePort health checks kube-router-gcc7j kube-router 2024-05-12T13:29:25.538571990-05:00 I0512 18:29:25.538461 19577 nodeport_healthcheck.go:172] Waiting for all NodePort health checks to finish shutting down kube-router-gcc7j kube-router 2024-05-12T13:29:25.538592493-05:00 I0512 18:29:25.538471 19577 nodeport_healthcheck.go:174] All NodePort health checks are completely shut down, all done! kube-router-gcc7j kube-router 2024-05-12T13:29:25.538600956-05:00 I0512 18:29:25.538492 19577 health_controller.go:208] Shutting down health controller kube-router-gcc7j kube-router 2024-05-12T13:29:25.538608832-05:00 I0512 18:29:25.538503 19577 network_services_controller.go:338] Shutting down network services controller kube-router-gcc7j kube-router 2024-05-12T13:29:25.538659884-05:00 I0512 18:29:25.538553 19577 network_policy_controller.go:189] Shutting down network policies full sync goroutine kube-router-gcc7j kube-router 2024-05-12T13:29:25.538677540-05:00 I0512 18:29:25.538573 19577 network_policy_controller.go:204] Shutting down network policies controller kube-router-gcc7j kube-router 2024-05-12T13:29:25.538684257-05:00 E0512 18:29:25.538570 19577 health_controller.go:199] Health controller error: http: Server closed kube-router-gcc7j kube-router 2024-05-12T13:29:25.538696533-05:00 I0512 18:29:25.538645 19577 network_routes_controller.go:407] Shutting down network routes controller kube-router-gcc7j kube-router 2024-05-12T13:29:25.538756684-05:00 I0512 18:29:25.538683 19577 route_sync.go:64] Shutting down local route synchronization kube-router-gcc7j kube-router 2024-05-12T13:29:25.538773942-05:00 I0512 18:29:25.538703 19577 health_controller.go:224] Shutting down HealthController RunCheck ```Events
``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12m default-scheduler Successfully assigned kube-system/kube-router-gcc7j to k8s-master-000 Normal Pulled 12m kubelet Successfully pulled image "docker.io/cloudnativelabs/kube-router" in 445ms (445ms including waiting). Image size: 102687278 bytes. Normal Pulled 12m kubelet Successfully pulled image "docker.io/cloudnativelabs/kube-router" in 384ms (384ms including waiting). Image size: 102687278 bytes. Warning Unhealthy 11m kubelet Liveness probe failed: Get "http://10.19.91.100:20244/healthz": dial tcp 10.19.91.100:20244: connect: connection refused Normal Pulled 11m kubelet Successfully pulled image "docker.io/cloudnativelabs/kube-router" in 440ms (440ms including waiting). Image size: 102687278 bytes. Normal Pulling 11m (x2 over 12m) kubelet Pulling image "docker.io/cloudnativelabs/kube-router" Normal Started 11m (x2 over 12m) kubelet Started container kube-router Normal Created 11m (x2 over 12m) kubelet Created container kube-router Normal Pulled 11m kubelet Successfully pulled image "docker.io/cloudnativelabs/kube-router" in 413ms (413ms including waiting). Image size: 102687278 bytes. Normal SandboxChanged 10m (x2 over 11m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulling 10m (x3 over 12m) kubelet Pulling image "docker.io/cloudnativelabs/kube-router" Normal Started 10m (x3 over 12m) kubelet Started container install-cni Normal Created 10m (x3 over 12m) kubelet Created container install-cni Normal Pulled 10m kubelet Successfully pulled image "docker.io/cloudnativelabs/kube-router" in 454ms (454ms including waiting). Image size: 102687278 bytes. Normal Killing 7m21s (x3 over 11m) kubelet Stopping container kube-router Warning BackOff 2m36s (x14 over 10m) kubelet Back-off restarting failed container kube-router in pod kube-router-gcc7j_kube-system(ddd069bf-e37e-4c9a-8c52-d09eab082d12) ```Thanks in advance for the help. LMK if I missed anything. I would happily troubleshoot this myself but with this amount of information I don't know where to start.