flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.61k stars 2.87k forks source link

bad network performance with large cluster #1823

Closed chris93111 closed 7 months ago

chris93111 commented 8 months ago

HI, i have a cluster k3s 1.24 with 8 nodes and 400 pods

i have a issu with python api that use consul kv to database when you call it I have big latency or poor performance, moving the API to another small cluster(same k3s 1.24) leaving consul the performance is 10x better, I also tried moving consul leaving the API, the performances are always degraded which makes me understand that I have a problem with this cluster probably network

Cluster/vm don't have saturation cpu/memory/network

in tcpdump i don't see bad checksum udp or bug of flannel / iptables is legacy iptables v1.4.21: no command specified ethtool -K eth0 tx off sg off tso off don't change perf iperf is same perf between small cluster and big

the k3s check show : (RHEL7/CentOS7: User namespaces disabled; add 'user_namespace.enable=1' to boot command line)

In this cluster i have metallb / knative / istio this generate many iptables rules, this can create a bad performance ?

Environmental Info: K3s Version: k3s version v1.24.7+k3s1 (https://github.com/k3s-io/k3s/commit/7af5b16788afe9ce1718d7b75b35eafac7454705) go version go1.18.7

Node(s) CPU architecture, OS, and Version: Linux vlpsocfg04-node 3.10.0-1160.42.2.el7.x86_64 https://github.com/k3s-io/k3s/issues/1 SMP Tue Aug 31 20:15:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Redhat 7.9

Cluster Configuration: 3 master 2 node with metallb and contour 8 node with longhorn

Describe the bug: slow performance network

rbrtbnfgl commented 8 months ago

Have you tried to check which is the network latency/throughput on your cluster?

chris93111 commented 8 months ago

Hi @rbrtbnfgl thanks for your response

I see nothing special in grafana, i have try to stop component with big usage of network longhorn knative ect ... but the result is the same , here is a diagram to help you understand my problem Cluster 1 and 2 are same components, 3 have only longhorn, few pods

image image image image image image
rbrtbnfgl commented 8 months ago

One check that you can do is to verify the iptables version. There were some distros shipped with a version that could have a bad performance when there are a lot of rules. You can try to run k3s with --preferred-bin flag to use the right version and verify if you still have the issue.

chris93111 commented 8 months ago

Hi @rbrtbnfgl , i have try now with --prefer-bundled-bin and clean iptables rules , but this change nothing Can i check if iptables 1.8.8 is applyed ?

If i run the python app directly on the host, not in k8s network , i don't have the probleme

rbrtbnfgl commented 7 months ago

If I understand your setup correctly. You are running ConsulDB as a deployment exposing as a service. Then you run the python application that tries to contact the different ConsulDB services on the various clusters. If the python application is run inside a pod on the cluster you get bad performance but if the same application is directly run on the nodes and tries to contact the ConsulDB on the various clusters you didn't get any issue.

chris93111 commented 7 months ago

@rbrtbnfgl Yes that's exactly it ! the pod have no limitation on cpu/memory

chris93111 commented 7 months ago

I found maybe the probleme, obviously the empty namespace works correctly with the python app, I saw an issue where the number of variables in the pod could slow down execution, knative creates many services in the target namespace

https://github.com/kubernetes/kubernetes/issues/92615 https://github.com/zio/zio-config/issues/418

i confirme my probleme is solved after disable servicelink

rbrtbnfgl commented 7 months ago

Good to know. Sorry if I couldn't help you enough.

chris93111 commented 7 months ago

@rbrtbnfgl No problem, thank you for the time given