cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.7k stars 2.88k forks source link

Poor throughput in Cilium compared to Calico with iPerf3 after migration #33669

Open cdtzabra opened 1 month ago

cdtzabra commented 1 month ago

Is there an existing issue for this?

What happened?

We have migrated the CNI of our Calico clusters to Cilium. For the latter, we set up iPerf3 tests before the migration to compare performance once it was complete. On all migrated clusters, iPerf3 throughput dropped drastically after the switch to Cilium.

I've seen 3 old issues closed without response that addressed this subject in the past.

Like these previous tickets, I can't understand this big difference in throughput when Cilium is supposed to perform as well or even better according to

On these graphs, we can see the drastic drop once the migration is complete. The test was done with iPerf3

An iPerf3 server pod And an iPer3 client cronjob that runs tests every 5 minutes

image image image

Cilium Version

v1.15.6

Kernel Version

5.15.0-107-generic #117-Ubunt

Kubernetes Version

v1.25.4

Regression

No response

Sysdump

Cannot supply as production environment

Relevant log output

Linux iperf-server-64b4468f4-9gprj 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201 (test #109)
-----------------------------------------------------------
Time: Tue, 09 Jul 2024 08:05:10 GMT
Accepted connection from 10.0.5.146, port 51580
      Cookie: 67nx54jazmvuw4fwor6tzlqqwi4qgai2ofvz
      TCP MSS: 0 (default)
[  5] local 10.0.8.165 port 5201 connected to 10.0.5.146 port 51592
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   427 MBytes  3.58 Gbits/sec
[  5]   1.00-2.00   sec   426 MBytes  3.57 Gbits/sec
[  5]   2.00-3.00   sec   430 MBytes  3.61 Gbits/sec
[  5]   3.00-4.03   sec   405 MBytes  3.29 Gbits/sec
[  5]   4.03-5.00   sec   400 MBytes  3.48 Gbits/sec
[  5]   5.00-6.00   sec   391 MBytes  3.28 Gbits/sec
[  5]   6.00-7.00   sec   375 MBytes  3.15 Gbits/sec
[  5]   7.00-8.01   sec   380 MBytes  3.17 Gbits/sec
[  5]   8.01-9.00   sec   358 MBytes  3.03 Gbits/sec
[  5]   9.00-10.00  sec   406 MBytes  3.39 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5] (sender statistics not available)
[  5]   0.00-10.00  sec  3.91 GBytes  3.35 Gbits/sec                  receiver
rcv_tcp_congestion cubic
iperf 3.12
Linux iperf-server-64b4468f4-9gprj 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201 (test #110)
-----------------------------------------------------------
Time: Tue, 09 Jul 2024 08:10:15 GMT
Accepted connection from 10.0.5.95, port 53134
      Cookie: 5fvt6t73cakuugb5ksibwnhqs6ljs2uovbst
      TCP MSS: 0 (default)
[  5] local 10.0.8.165 port 5201 connected to 10.0.5.95 port 53140
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   411 MBytes  3.45 Gbits/sec
[  5]   1.00-2.00   sec   426 MBytes  3.57 Gbits/sec
[  5]   2.00-3.00   sec   449 MBytes  3.77 Gbits/sec
[  5]   3.00-4.00   sec   442 MBytes  3.71 Gbits/sec
[  5]   4.00-5.00   sec   384 MBytes  3.22 Gbits/sec
[  5]   5.00-6.00   sec   469 MBytes  3.93 Gbits/sec
[  5]   6.00-7.00   sec   396 MBytes  3.32 Gbits/sec
[  5]   7.00-8.02   sec   364 MBytes  2.98 Gbits/sec
[  5]   8.02-9.00   sec   364 MBytes  3.13 Gbits/sec
[  5]   9.00-10.00  sec   365 MBytes  3.07 Gbits/sec
[  5]  10.00-10.01  sec  1.19 MBytes  1.11 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5] (sender statistics not available)
[  5]   0.00-10.01  sec  3.98 GBytes  3.41 Gbits/sec                  receiver
rcv_tcp_congestion cubic
iperf 3.12
Linux iperf-server-64b4468f4-9gprj 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64
-----------------------------------------------------------
Server listening on 5201 (test #111)
-----------------------------------------------------------
Time: Tue, 09 Jul 2024 08:15:11 GMT
Accepted connection from 10.0.5.101, port 50142
      Cookie: x7danyuzlcxnhrapj3hel3obgbpqcpznnc5m
      TCP MSS: 0 (default)
[  5] local 10.0.8.165 port 5201 connected to 10.0.5.101 port 50154
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   320 MBytes  2.69 Gbits/sec
[  5]   1.00-2.00   sec   443 MBytes  3.71 Gbits/sec
[  5]   2.00-3.00   sec   420 MBytes  3.52 Gbits/sec
[  5]   3.00-4.00   sec   366 MBytes  3.07 Gbits/sec
[  5]   4.00-5.00   sec   431 MBytes  3.61 Gbits/sec
[  5]   5.00-6.00   sec   412 MBytes  3.46 Gbits/sec
[  5]   6.00-7.00   sec   449 MBytes  3.76 Gbits/sec
[  5]   7.00-8.00   sec   431 MBytes  3.62 Gbits/sec
[  5]   8.00-9.00   sec   401 MBytes  3.36 Gbits/sec
[  5]   9.00-10.00  sec   416 MBytes  3.48 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate
[  5] (sender statistics not available)
[  5]   0.00-10.00  sec  3.99 GBytes  3.43 Gbits/sec                  receiver
rcv_tcp_congestion cubic
iperf 3.12
Linux iperf-server-64b4468f4-9gprj 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64

Anything else?

Cilium is installed with KubeproxReplacement: true, bfp-masquerade enabled, l7proxy enabled and hubble enabled ...

kubectl get cm -o yaml cilium-config -n kube-system | egrep "enable-bpf-masquerade|enable-ipv4-egress-gateway|kube-proxy-replacement|enable-l7-proxy|tunnel"
  enable-bpf-masquerade: "true"
  enable-l7-proxy: "true"
  kube-proxy-replacement: "true"
  kube-proxy-replacement-healthz-bind-address: 0.0.0.0:10256
  routing-mode: tunnel
  tunnel-port: "8473"
  tunnel-protocol: vxlan
  enable-hubble: "true"

Cilium Users Document

Code of Conduct

learnitall commented 1 month ago

Would you be able to run the test after disabling Hubble and performing a rollout of the cilium-agent Daemonset?

cdtzabra commented 1 month ago

Would you be able to run the test after disabling Hubble and performing a rollout of the cilium-agent Daemonset?

Of course. I'll disable it tomorrow, leave the cronjob iperf3 client running over the weekend and on Monday/Tuesday I'll add a new screenshot

borkmann commented 1 month ago

Also, if you could check with 1.14 and 1.16 as well, that would be interesting to see if there's a regression of some sort.

marseel commented 1 month ago

I wonder, could you share more information about your setup? Do you use a cloud provider or on-prem? What are the nodes on which the server/client is running on? Could you ensure that client/server pods are running on new nodes and not migrated nodes? Just to make sure there is no lingering state of Calico on the node hindering performance. Could you share iperf3 server/cronjob yamls? Just to make our replication a bit easier. I am guessing traffic to server goes through k8s service?

I'm asking due to two reasons:

cdtzabra commented 1 month ago

Would you be able to run the test after disabling Hubble and performing a rollout of the cilium-agent Daemonset?

Of course. I'll disable it tomorrow, leave the cronjob iperf3 client running over the weekend and on Monday/Tuesday I'll add a new screenshot

@learnitall

Here are the new screenshots: No change in performance. (Not very surprised, on another cluster, weeks before, I had disabled the hubble for 1 hour and it hadn't changed anything)

image image
cdtzabra commented 1 month ago

Also, if you could check with 1.14 and 1.16 as well, that would be interesting to see if there's a regression of some sort.

I'll be able to test 1.16 as soon as we've migrated a cluster to this version. However, I won't be able to test 1.14. No cluster available with this version

cdtzabra commented 1 month ago

I wonder, could you share more information about your setup? Do you use a cloud provider or on-prem? What are the nodes on which the server/client is running on? Could you ensure that client/server pods are running on new nodes and not migrated nodes? Just to make sure there is no lingering state of Calico on the node hindering performance. Could you share iperf3 server/cronjob yamls? Just to make our replication a bit easier. I am guessing traffic to server goes through k8s service?

I'm asking due to two reasons:

  • to figure out if we can replicate that easily
  • Also, did iperf3 server change nodes (I'm guessing probably yes)? Usually, different nodes in cloud-provider environments can have different traffic throughput.

Do you use a cloud provider or on-prem?

It is on-prem infras : vSphere

What are the nodes on which the server/client is running on?

Server and client are running on the same cluster and can be scheduled on any node where (customer/app) workloads are allowed. All node

Could you ensure that client/server pods are running on new nodes and not migrated nodes? Just to make sure there is no lingering state of Calico on the node hindering performance.`

We don't have any new added node on the cluster. All nodes whithin the cluster are migrated nodes

We did a cleanup and reboot after deleting calico

Kube cleanup

# Identifier la version de Calico
CALICO_VERSION=$(kubectl -n kube-system get ds/calico-node -o yaml | grep "calico-cni:v[0-9]*.[0-9]*.[0-9]*" -o  | cut -s -d ':' -f2   | head -n1)

echo $CALICO_VERSION
  v3.24.5

kubectl -n kube-system delete -f https://raw.githubusercontent.com/projectcalico/calico/${CALICO_VERSION}/manifests/calico.yaml
  configmap "calico-config" deleted
  Warning: deleting cluster-scoped resources, not scoped to the provided namespace
  customresourcedefinition.apiextensions.k8s.io "bgpconfigurations.crd.projectcalico.org" deleted
  ....
  daemonset.apps "calico-node" deleted
  deployment.apps "calico-kube-controllers" deleted

# Re-check
kubectl get all --all-namespaces | grep calico
kubectl get crd | grep calico
kubectl api-resources  -o name  | grep calico

# On each node

Node cleanup

# Drain-cordon le node
kubectl drain --ignore-daemonsets --delete-emptydir-data --disable-eviction $NODE

ls -al /etc/cni/net.d/calico* /var/lib/calico
rm -rf /etc/cni/net.d/calico* /var/lib/calico

ls -al /opt/cni/bin/calico /opt/cni/bin/calico-ipam
rm -rf /opt/cni/bin/calico /opt/cni/bin/calico-ipam

iptables -L -n --line-numbers
iptables -F
iptables -X

# Reboot le node
reboot

Could you share iperf3 server/cronjob yamls? Just to make our replication a bit easier.

I just pushed them here (sorry it's not helm Chart) : https://github.com/cdtzabra/iPerf3-Kubernetes-CNI-Benchmark/tree/main

learnitall commented 1 month ago

Thanks for all the extra detail @cdtzabra, it's super helpful. I chatted with @marseel about this offline. Both of us are limited in our understanding of vSphere and the details of Calico's installation and operation. There are some things that would be worth sanity checking over though before we start digging into Cilium's datapath.

One thing that stood out in your previous comments was:

Server and client are running on the same cluster and can be scheduled on any node where (customer/app) workloads are allowed. All node

We'd like to confirm that the characteristics of the node that the iperf3 server and client are deployed to aren't making an impact. We are wondering: did the installation of Cilium impact the results of the iperf3 workload, or did the rescheduling of pods in the cluster impact them?

Couple of questions to help dig into this:

  1. Would you be able to determine the characteristics of the node that the iperf3 server was deployed to before/after migrating to Cilium? Is there a difference in the type of NIC, amount of memory, the amount of CPU, or the type of customer/app workloads that are deployed on the node?
  2. Would you also be able to do the same for the iperf3 clients? Is there a difference in the general pattern of nodes that the clients are deployed to before/after migrating to Cilium?
cdtzabra commented 1 month ago
  • Would you be able to determine the characteristics of the node that the iperf3 server was deployed to before/after migrating to Cilium? Is there a difference in the type of NIC, amount of memory, the amount of CPU, or the type of customer/app workloads that are deployed on the node?
  • Would you also be able to do the same for the iperf3 clients? Is there a difference in the general pattern of nodes that the clients are deployed to before/after migrating to Cilium?

there's no difference. The nodes are exactly the same before and after migration. Nothing has changed.

To re-precise, we only migrated from CNI Calico to CNI Cilium on our existing clusters. So the clusters are still the same with the same nodes, the nodes haven't changed (the workloads run on the same nodes before and after the migration).

The CNI migration was based on this doc: https://docs.cilium.io/en/stable/installation/k8s-install-migration. I just transposed the steps into a Makefile to limit copy-paste.

learnitall commented 1 month ago

Ok, thank you for confirming. Would you be able to share the details of the configuration of your previous Calico installation? This will help me replicate the performance drop and do some debugging.

cdtzabra commented 1 month ago

Ok, thank you for confirming. Would you be able to share the details of the configuration of your previous Calico installation? This will help me replicate the performance drop and do some debugging.

AFter creating the cluster with kubeadm, we deployed calico with kubectl -n kube-system apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.24.5/manifests/calico.yaml

squeed commented 1 month ago

Hi there, @cdtzabra. I'm the author of that migration document. One of the important steps in calico migration is using so-called legacy host routing, which is necessary for hybrid mode. Once migration is complete, you can re-enable eBPF host routing. Have you done this? It may account for the performance cost.

cdtzabra commented 1 month ago

Hi there, @cdtzabra. I'm the author of that migration document. One of the important steps in calico migration is using so-called legacy host routing, which is necessary for hybrid mode. Once migration is complete, you can re-enable eBPF host routing. Have you done this? It may account for the performance cost.

Hi @squeed

YES, we have enabled eBPF host routing

In the output below you can see that Host Routing = BPF.

k -n kube-system exec -it ds/cilium -c cilium-agent -- cilium status

KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.28 (v1.28.4) [linux/amd64]
Kubernetes APIs:         ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    True   [ens192   192.168.221.125 fe80::250:56ff:fe92:e8a3 (Direct Routing)]
Host firewall:           Disabled
SRv6:                    Disabled
CNI Chaining:            none
CNI Config file:         successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist
Cilium:                  Ok   1.15.6 (v1.15.6-a09e05e6)
NodeMonitor:             Listening for events on 128 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok
IPAM:                    IPv4: 4/254 allocated from 10.0.7.0/24,
IPv4 BIG TCP:            Disabled
IPv6 BIG TCP:            Disabled
BandwidthManager:        Disabled
Host Routing:            BPF
Masquerading:            BPF   [ens192]   10.0.7.0/24 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       33/33 healthy
Proxy Status:            OK, ip 10.0.7.152, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Ok                Current/Max Flows: 4095/4095 (100.00%), Flows/s: 90.24   Metrics: Ok
Encryption:              Disabled
Cluster health:          16/16 reachable   (2024-07-26T15:53:22Z)
Modules Health:          Stopped(0) Degraded(0) OK(11)

Verbose mode

k -n kube-system exec -it ds/cilium -c cilium-agent -- cilium-dbg status --verbose
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.28 (v1.28.4) [linux/amd64]
Kubernetes APIs:        ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   True   [ens192   192.168.221.125 fe80::250:56ff:fe92:e8a3 (Direct Routing)]
Host firewall:          Disabled
SRv6:                   Disabled
CNI Chaining:           none
CNI Config file:        successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist
Cilium:                 Ok   1.15.6 (v1.15.6-a09e05e6)
NodeMonitor:            Listening for events on 128 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPAM:                   IPv4: 4/254 allocated from 10.0.7.0/24,
Allocated addresses:
  10.0.7.152 (router)
  10.0.7.175 (kube-system/coredns-7b6dc7894d-7qwwd)
  10.0.7.196 (health)
  10.0.7.80 (elastic-system/filebeat-filebeat-7hjch [restored])
IPv4 BIG TCP:           Disabled
IPv6 BIG TCP:           Disabled
BandwidthManager:       Disabled
Host Routing:           BPF
Masquerading:           BPF   [ens192]   10.0.7.0/24 [IPv4: Enabled, IPv6: Disabled]
Clock Source for BPF:   ktime
Controller Status:      33/33 healthy
  Name                                                    Last success     Last error       Count   Message
  bpf-map-sync-cilium_lxc                                 9s ago           never            0       no error
  cilium-health-ep                                        37s ago          never            0       no error
  dns-garbage-collector-job                               52s ago          never            0       no error
  endpoint-1111-regeneration-recovery                     never            never            0       no error
  endpoint-1503-regeneration-recovery                     never            never            0       no error
  endpoint-3332-regeneration-recovery                     never            never            0       no error
  endpoint-636-regeneration-recovery                      never            never            0       no error
  endpoint-gc                                             2m57s ago        never            0       no error
  ep-bpf-prog-watchdog                                    25s ago          never            0       no error
  fqdn-selector-checkpointing                             266h53m35s ago   never            0       no error
  ipcache-inject-labels                                   37s ago          266h53m35s ago   0       no error
  k8s-heartbeat                                           25s ago          never            0       no error
  link-cache                                              11s ago          never            0       no error
  neighbor-table-refresh                                  5s ago           never            0       no error
  resolve-identity-1111                                   4m12s ago        never            0       no error
  resolve-identity-1503                                   2m53s ago        never            0       no error
  resolve-labels-/                                        266h53m29s ago   never            0       no error
  resolve-labels-elastic-system/filebeat-filebeat-7hjch   266h53m26s ago   never            0       no error
  resolve-labels-kube-system/coredns-7b6dc7894d-7qwwd     258h29m40s ago   never            0       no error
  restoring-ep-identity (3332)                            266h53m29s ago   never            0       no error
  restoring-ep-identity (636)                             266h53m29s ago   never            0       no error
  sync-host-ips                                           37s ago          never            0       no error
  sync-lb-maps-with-k8s-services                          266h53m29s ago   never            0       no error
  sync-policymap-1111                                     14m13s ago       never            0       no error
  sync-policymap-1503                                     7m52s ago        never            0       no error
  sync-policymap-3332                                     7m52s ago        never            0       no error
  sync-policymap-636                                      7m50s ago        never            0       no error
  sync-to-k8s-ciliumendpoint (1111)                       9s ago           never            0       no error
  sync-to-k8s-ciliumendpoint (3332)                       9s ago           never            0       no error
  sync-utime                                              42s ago          never            0       no error
  template-dir-watcher                                    never            never            0       no error
  waiting-initial-global-identities-ep (3332)             266h53m29s ago   never            0       no error
  write-cni-file                                          266h53m35s ago   never            0       no error
Proxy Status:            OK, ip 10.0.7.152, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 90.24   Metrics: Ok
KubeProxyReplacement Details:
  Status:                 True
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Socket LB Coverage:     Full
  Devices:                ens192   192.168.221.125 fe80::250:56ff:fe92:e8a3 (Direct Routing)
  Mode:                   SNAT
  Backend Selection:      Random
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled
BPF Maps:   dynamic sizing: on (ratio: 0.002500)
  Name                          Size
  Auth                          524288
  Non-TCP connection tracking   73374
  TCP connection tracking       146749
  Endpoint policy               65535
  IP cache                      512000
  IPv4 masquerading agent       16384
  IPv6 masquerading agent       16384
  IPv4 fragmentation            8192
  IPv4 service                  65536
  IPv6 service                  65536
  IPv4 service backend          65536
  IPv6 service backend          65536
  IPv4 service reverse NAT      65536
  IPv6 service reverse NAT      65536
  Metrics                       1024
  NAT                           146749
  Neighbor table                146749
  Global policy                 16384
  Session affinity              65536
  Sock reverse NAT              73374
  Tunnel                        65536
Encryption:                                          Disabled