Too many BGP routing entries and neighbors between kube-router server and connected network devices

cloudnativer commented 4 years ago

Using Kube router in large-scale kubernetes cluster will lead to too many BGP neighbors and BGP routing entries of Kube router server and connected network devices by default, which will seriously affect the network performance of the cluster. Is there any good way to reduce the routing entries of both sides and the performance loss, so as to support the larger cluster network?

large-networks03

cloudnativer commented 4 years ago

You can try the following two methods: (1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device. (2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

cloudnativer commented 4 years ago

You can try the following two methods: (1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device. (2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot. before: large-networks03

after: large-networks04

cloudnativer commented 4 years ago

You can try the following two methods: (1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device. (2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot. before:

after:

According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch. large-networks09 However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.

cloudnativer commented 4 years ago

You can try the following two methods: (1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device. (2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot. before: after:

According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch. However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.

We modified part of the source code of kube-router and added parameters such as "advertisement-cluster-subnet" to solve this problem.

cloudnativer commented 4 years ago

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than one year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network. Please see https://github.com/cloudnativelabs/kube-router/pull/920.

rearden-steel commented 4 years ago

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

murali-reddy commented 4 years ago

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. https://github.com/cloudnativelabs/kube-router/issues/923#issuecomment-638599383 these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

cloudnativer commented 4 years ago

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

Yes, when I communicate with many R & D personnel of other companies, I find that they have the same problem. When the scale of kubernets cluster network becomes larger, the problem becomes more serious.

cloudnativer commented 4 years ago

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time: (1) We need to route the kubernetes service to the outside for direct access to the service; (2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links; (3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation（ https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time: (1) We need to route the kubernetes service to the outside for direct access to the service; (2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links; (3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation（ https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

cloudnativer commented 4 years ago

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

Let me add that I will further improve the document according to what you said in the near future.

murali-reddy commented 4 years ago

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?

cloudnativer commented 4 years ago

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?

[requirements and test instructions]

Suppose we have a kubernetes service network segment with a range of 172.30.0.0/16. There are 100 running services in the cluster. Our node server has a 172.32.0.128/25 pod CIDR network segment with 20 running pods. We need to announce the two network segments of kubernetes service and kubernetes pod to the connected network device, so that we can directly access the service and pod from the outside. We did the following tests according to your method.

[Test 1]

kube-router image version:

image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

args:
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --enable-overlay=false
- --enable-pod-egress=false
- --advertise-cluster-ip=false
- --advertise-pod-cidr=true
- --masquerade-all=false
- --bgp-graceful-restart=true
- --enable-ibgp=false
- --nodes-full-mesh=true
- --cluster-asn=64558
- --peer-router-ips=192.168.140.1
- --peer-router-asns=64558
- --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 2]

kube-router image version:

image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

args:
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --enable-overlay=false
- --enable-pod-egress=false
- --advertise-cluster-ip=false
- --advertise-pod-cidr=false
- --masquerade-all=false
- --bgp-graceful-restart=true
- --enable-ibgp=false
- --nodes-full-mesh=true
- --cluster-asn=64558
- --peer-router-ips=192.168.140.1
- --peer-router-asns=64558
- --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 3]

kube-router image version:

image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

args:
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --enable-overlay=false
- --enable-pod-egress=false
- --advertise-cluster-ip=true
- --advertise-pod-cidr=false
- --masquerade-all=false
- --bgp-graceful-restart=true
- --enable-ibgp=false
- --nodes-full-mesh=true
- --cluster-asn=64558
- --peer-router-ips=192.168.140.1
- --peer-router-asns=64558
- --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 4]

kube-router image version:

image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

3 Args is set to:

    args:
    - --run-router=true
    - --run-firewall=true
    - --run-service-proxy=true
    - --enable-overlay=false
    - --enable-pod-egress=false
    - --advertise-cluster-ip=true
    - --advertise-pod-cidr=true
    - --masquerade-all=false
    - --bgp-graceful-restart=true
    - --enable-ibgp=false
    - --nodes-full-mesh=true
    - --cluster-asn=64558
    - --peer-router-ips=192.168.140.1
    - --peer-router-asns=64558
    - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 5]

kube-router image version:

image: My branch version (https://github.com/cloudnativer/kube-router-cnlabs/tree/advertise-cluster-subnet)

Annotations is not set.

Args is set to:

args:
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --enable-overlay=false
- --enable-pod-egress=false
- --advertise-cluster-ip=true
- --advertise-cluster-subnet=172.30.0.0/16
- --advertise-pod-cidr=true
- --masquerade-all=false
- --bgp-graceful-restart=true
- --enable-ibgp=false
- --nodes-full-mesh=true
- --cluster-asn=64558
- --peer-router-ips=192.168.140.1
- --peer-router-asns=64558
- --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

@murali-reddy

Attach my yaml template file for testing:

test.yaml.txt

I didn't use "kube-router.io/service.advertise.clusterip" to test the effect you said. Did I test it wrong? Or this "kube-router.io/service.advertise.clusterip" can't realize my previous requirements? But we did use the "advertise-cluster-subnet" parameter to implement the previous requirements.

cloudnativer commented 4 years ago

Please note that I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range". Keep the same parameter names as kube-api-server, kubeadm etc. Please see https://github.com/cloudnativelabs/kube-router/pull/920.

murali-reddy commented 4 years ago

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

cloudnativer commented 4 years ago

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

OK。

murali-reddy commented 4 years ago

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicated load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentially each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

cloudnativer commented 4 years ago

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicate load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentiall each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

Yes, I agree with that.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

Yes, we can advertise the service IP range on the leaf router to reduce the number of spine routers.But in a large-scale kubernetes cluster network, if all Kube-routers advertise 32-bit host routing, the number of routes on the leaf router will also multiply. If only advertising the service IP range on the leaf router, it can't solve the problem of increasing the number of routes on the leaf router itself.Therefore, we need to be able to achieve the IP range of advertising service on the Kube-router of the server, which is used to reduce the number of leaf routers and uplink routers.

cloudnativer commented 4 years ago

According to the requirements of murali-Reddy, we split the document and code:

The documentation for solving this problem is here: https://github.com/cloudnativelabs/kube-router/pull/944 .
The code to solve this problem is here: https://github.com/cloudnativelabs/kube-router/pull/920 .

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stale for 5 days with no activity.

cloudnativelabs / kube-router

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923