Failed to reconcile routes, retrying after error: network is unreachable

aojea / kindnet

minimalistic Kubernetes network plugin

128 stars 19 forks source link

Failed to reconcile routes, retrying after error: network is unreachable #46

Closed dims closed 1 month ago

dims commented 2 months ago

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ec2-eks-al2023/1814933511295995904

2024-07-21T08:37:55.30079951Z stderr F I0721 08:37:55.300761       1 main.go:252] Failed to reconcile routes, retrying after error: network is unreachable
2024-07-21T08:37:59.30429058Z stderr F panic: Maximum retries reconciling node routes: network is unreachable
2024-07-21T08:37:59.304315797Z stderr F 
2024-07-21T08:37:59.304320194Z stderr F goroutine 1 [running]:
2024-07-21T08:37:59.304324062Z stderr F main.main()
2024-07-21T08:37:59.304327791Z stderr F     /src/cmd/kindnetd/main.go:256 +0x1414

See kindnet-cni logs from

dims commented 2 months ago

xref: https://github.com/kubernetes/kubernetes/issues/126255

aojea commented 2 months ago

Hmm , network unreachable :thinking:

aojea commented 2 months ago

checking this cloud init https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ec2-eks-al2023/1814933511295995904/artifacts/logs/i-008394e69e833839a/cloud-init.log

2024-07-21 08:11:14,265 - url_helper.py[DEBUG]: Read from http://169.254.169.254/2021-03-23/dynamic/instance-identity/signature (200, 174b) after 1 attempts
2024-07-21 08:11:14,265 - util.py[DEBUG]: Crawl of metadata service took 0.180 seconds
2024-07-21 08:11:14,265 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', 'default', 'dev', 'ens5'] with allowed return codes [0] (shell=False, capture=True)
2024-07-21 08:11:14,267 - subp.py[DEBUG]: Running command ['ip', '-4', 'route', 'del', '172.31.80.1', 'dev', 'ens5', 'src', '172.31.89.46'] with allowed return codes [0] (shell=False, capture=True)

why are the default routes deleted ?

dims commented 2 months ago

no idea! will watch out for this again

aojea commented 1 month ago

for reference, we discussed in slack, the failing jobs use Nodes in different subnets, when kindnet tries to add the route of the pod subnet through the node IP it fails, because the IP is not reachable (it has to be in the same subnet)

If we use the default gateway , then the VPC must have knowledge of the nodes and pods subnets and route the traffic to the corresponding node, I don't know if this is possible but I don't recommend this setup as it complicates the network and leaks details of the cluster to the VPC.

Another option is to create an overlay between nodes, but then you have a more complex setup harder to troubleshoot and with considerable worse performance.

My recommendation is for kubetest2 to always deploy the nodes in the same VPC subnet

dims commented 1 month ago

/close

Done. thanks!

dims commented 1 month ago

My recommendation is for kubetest2 to always deploy the nodes in the same VPC subnet

thanks @aojea i agree.