cloudnativelabs / kube-router

Kube-router, a turnkey solution for Kubernetes networking.
https://kube-router.io
Apache License 2.0
2.31k stars 468 forks source link

kube-router Holding on to Routes #1738

Open aauren opened 1 month ago

aauren commented 1 month ago

What happened?

Over time, it appears that kube-router hits conditions where it will keep BGP routes that have since been withdrawn. This likely happens because route_sync.go contains its own cache of routes and at some point it isn't able to receive a BGP update for one reason or another.

Because of this problem, kube-router continues to put back bad routes to nexthops that no longer contain the service which essentially blackholes the traffic bound for that service.

What did you expect to happen?

kube-router to have an accurate route state at all times.

How can we reproduce the behavior you experienced?

This behavior is not easily reproduced and the exact cause of the issue is not yet known. It is something that involves state over time.

System Information (please complete the following information)

--advertise-external-ip=true --bgp-graceful-restart=true --bgp-graceful-restart-deferral-time=60s --enable-ibgp=false --enable-overlay=false --hairpin-mode=true --kubeconfig=/etc/kubernetes/kubectl-config.yaml --metrics-port=9081 --nodes-full-mesh=false --run-router=true --run-firewall=true --service-cluster-ip-range=172.28.0.0/16 --service-external-ip-range=192.168.1.0/24 --service-external-ip-range=192.168.2.0/24 --peer-router-ips=192.168.3.1,192.168.3.2,192.168.3.3 --peer-router-asns=4220000001,4220000001,4220000001 --peer-router-passwords-file=/etc/kube-router-bgp.conf --cluster-asn=4220000001

Logs, other output, metrics

No logs show up with this issue

Additional context

kube-router probably needs to add a consistency check that happens periodically when the routes_sync controller is running.

This would allow the controller to be primarily event driven, but also retrue it's state from time to time to ensure that it doesn't get into an inconsistent state with the desired state of the BGP subsystem.

aauren commented 13 hours ago

This was accidentally closed on the merge of #1739 and is not yet fully completed.