coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
311 stars 55 forks source link

【Advice】Providing performance testing documents and data #42

Closed tanjunchen closed 9 months ago

tanjunchen commented 10 months ago

May I ask what is the additional network latency for L4 and L7 caused by coroot-node-agent? What is the impact of eBPF application topology, trace, etc. on the application? Can official documents provide performance pressure test data?

After our testing, the coroot-node-agent network latency p90 increased by 6460us and QPS decreased by 50%, as shown in the flame diagram below. image

50faa2b703bca3c9eb9496bc91f2c13d 987a63f0f0dfaf69fd9d7245ac04eef2

Why is the value size of #define MAX_PAYLOAD_SIZE 1024? Is it quite time-consuming to process the logic of L7 here, and why is it 1024 bytes?

the process of performance test:

  1. deploy coroot according to the document of coroot website.
    ➜  ebpf-performance kubectl -n coroot get pod -owide
    NAME                                              READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
    coroot-68d887b548-4fhkn                           1/1     Running   0          16d   10.2.2.10    192.168.1.14   <none>           <none>
    coroot-clickhouse-shard0-0                        1/1     Running   0          16d   10.2.2.54    192.168.1.14   <none>           <none>
    coroot-kube-state-metrics-597cfdc9f5-pjvxm        1/1     Running   0          16d   10.2.2.209   192.168.1.14   <none>           <none>
    coroot-node-agent-6wshb                           1/1     Running   0          16d   10.2.2.219   192.168.1.14   <none>           <none>
    coroot-node-agent-cfsfx                           1/1     Running   0          16d   10.2.1.124   192.168.1.20   <none>           <none>
    coroot-node-agent-rt8hk                           1/1     Running   0          16d   10.2.0.110   192.168.1.24   <none>           <none>
    coroot-opentelemetry-collector-6659857566-nw4m4   1/1     Running   0          40h   10.2.2.160   192.168.1.14   <none>           <none>
    coroot-prometheus-server-669b7ccbb6-jfvzn         2/2     Running   0          16d   10.2.2.216   192.168.1.14   <none>           <none>
    coroot-pyroscope-6fb8fc4db-l5df5                  1/1     Running   0          16d   10.2.2.102   192.168.1.14   <none>           <none>
    coroot-pyroscope-ebpf-6c6wx                       1/1     Running   0          16d   10.2.0.54    192.168.1.24   <none>           <none>
    coroot-pyroscope-ebpf-dj6c6                       1/1     Running   0          16d   10.2.2.61    192.168.1.14   <none>           <none>
    coroot-pyroscope-ebpf-tjkcq                       1/1     Running   0          16d   10.2.1.59    192.168.1.20   <none>           <none>
  2. deploy the client and server of perfomance test with command taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://ip --latency.
    ➜  ebpf-performance kubectl -n cilium-test get pod -owide
    NAME                    READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
    nginx-b89648f96-2bz7r   1/1     Running   0          11m   10.2.2.57    192.168.1.14   <none>           <none>
    wrk-58fb8c49ff-d7p2c    1/1     Running   0          33m   10.2.1.161   192.168.1.20   <none>           <none>
  3. we can start our performance test with client and server. the result of without coroot environment:
    [root@wrk-58fb8c49ff-s4g8b /]# taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://172.16.2.191 --latency
    Running 1m test @ http://172.16.2.191
    2 threads and 4 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   286.99us  357.21us  12.03ms   96.71%
    Req/Sec     8.22k     1.90k   16.70k    89.92%
    Latency Distribution
     50%  235.00us
     75%  252.00us
     90%  297.00us
     99%    2.23ms
    982111 requests in 1.00m, 796.08MB read
    Requests/sec:  16366.99
    Transfer/sec:     13.27MB

    the result of with coroot environment:

    [root@wrk-58fb8c49ff-d7p2c /]# taskset -c 0-1 wrk -t 2 -c 4 -d 60s http://10.2.2.57 --latency
    Running 1m test @ http://10.2.2.57
    2 threads and 4 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.78ms    7.83ms 182.95ms   92.91%
    Req/Sec     3.22k     1.58k    8.56k    60.55%
    Latency Distribution
     50%  394.00us
     75%    1.43ms
     90%    7.29ms
     99%   33.57ms
    384280 requests in 1.00m, 311.49MB read
    Requests/sec:   6396.37
    Transfer/sec:      5.18MB
  4. the test environment:
    os:Ubuntu / 20.04 LTS amd64 (64bit)   
    cri:containerd 1.6.20
    Kubernetes version:1.24.4
    kernel version:5.4.0-139-generic
  5. the yaml of client and server.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: wrk
    spec:
    selector:
    matchLabels:
      run: wrk
    replicas: 1
    template:
    metadata:
      labels:
        run: wrk
    spec:
      initContainers:
      - name: setsysctl
        image: xxx/busybox:latest
        securityContext:
          privileged: true
        command:
        - sh
        - -c
        - |
          sysctl -w net.core.somaxconn=65535
          sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          sysctl -w net.ipv4.tcp_tw_reuse=1
          sysctl -w fs.file-max=1048576
      containers:
      - name: wrk
        image: xxx/wrk:4.2.0
        ports:
        - containerPort: 80
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: nginx
    labels:
    app: nginx
    spec:
    replicas: 1
    minReadySeconds: 0
    strategy:
    type: RollingUpdate # 策略:滚动更新
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    selector:
    matchLabels:
      app: nginx
    template:
    metadata:
      labels:
        sidecar.istio.io/inject: "false"
        app: nginx
    spec:
      restartPolicy: Always
      initContainers:
        - name: setsysctl
          image: xxx/busybox:latest
          securityContext:
            privileged: true
          command:
            - sh
            - -c
            - |
              sysctl -w net.core.somaxconn=65535
              sysctl -w net.ipv4.ip_local_port_range="1024 65535"
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w fs.file-max=1048576
      containers:
        - name: nginx
          image: xxx/nginx:1.14.2
          imagePullPolicy: Always
          ports:
            - containerPort: 80
          command:
            - /bin/sh
            - -c
            - "cd /usr/share/nginx/html/ && dd if=/dev/zero of=1k bs=1k count=1 && dd if=/dev/zero of=100k bs=1k count=100 && nginx -g \"daemon off;\""
def commented 10 months ago

eBPF limitations make it challenging to fully implement all L7 parsing on the kernel side. The choice of a 1024-byte payload size has been made to provide sufficient data for parsing L7 protocols in userland.

def commented 9 months ago

I've implemented several performance optimizations using your benchmark approach.

taskset -c 1-2 wrk -t 2 -c 4 -d 60s http://10.42.0.9:80/ --latency
Running 1m test @ http://10.42.0.9:80/
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   189.95us  129.27us   9.60ms   96.94%
    Req/Sec    10.62k     1.11k   15.71k    72.67%
  Latency Distribution
     50%  171.00us
     75%  207.00us
     90%  259.00us
     99%  414.00us
  1268200 requests in 1.00m, 1.00GB read
Requests/sec:  21136.58
Transfer/sec:     17.13MB

This 9% degradation in request throughput can be attributed to the agent's CPU consumption, which reaches 30% of one CPU core. I expect that if there is no competition for CPU time, the degradation will be much less. At the eBPF level, the kernel ensures that the observer program does not introduce significant additional latency

@tanjunchen thank you for bringing up this topic

def commented 9 months ago

We've added the benchmark results to the documentation: https://coroot.com/docs/coroot-community-edition/getting-started/performance-impact