deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.97k stars 333 forks source link

[BUG] Does deepflow-agent affect the performance of the Application ? #3190

Open kwenzh opened 1 year ago

kwenzh commented 1 year ago

Search before asking

DeepFlow Component

Agent

What you expected to happen

deploy deepflow in k8s cluster, We have found a performance degradation in the programs within the cluster, including delays, QPS, and mainly some HTTP services and consumer mq tasks. Performance has decreased by about 40%.

How to reproduce

make a test case

Then I did a simple test, starting an http api and testing it with ab tools demo code:

#!/user/bin/env python
# -*- coding:utf-8 -*-
"""

"""

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
import urlparse
import json
import random
import string

class HTTPHandler(BaseHTTPRequestHandler):

    def do_GET(self):
        print self.path
        res = urlparse.urlparse(self.path)
        param = urlparse.parse_qs(res.query)

        resp = ""
        for k, v in param.items():
            if k == "num":
                val = v[0]
                rs = random.sample("xxxxsfsdffghehuwfnsajfisddjsddmidsa", 10)
                print k, v[0], rs
                resp = val + "".join(rs)

        self.send_response(200, "OK")
        self.end_headers()
        self.wfile.write(resp)

httserver = HTTPServer(("0.0.0.0", 8001), HTTPHandler)

print ">>>>>>>>>> start "
httserver.serve_forever()

client test eg: ab -n 2000 -c 10 http://k8s-ip:nodeport/get_num?num=22

no deepflow-agent result:

Server Software:        BaseHTTP/0.3
Server Hostname:        10.65.138.101
Server Port:            32246

Document Path:          /get_num?num=22
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   0.458 seconds
Complete requests:      2000
Failed requests:        0
Write errors:           0
Total transferred:      206000 bytes
HTML transferred:       24000 bytes
Requests per second:    4363.35 [#/sec] (mean)
Time per request:       2.292 [ms] (mean)
Time per request:       0.229 [ms] (mean, across all concurrent requests)
Transfer rate:          438.89 [Kbytes/sec] received

deeoflow-agent running:


Document Path:          /get_num?num=22
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   0.628 seconds
Complete requests:      2000
Failed requests:        0
Write errors:           0
Total transferred:      206000 bytes
HTML transferred:       24000 bytes
Requests per second:    3186.86 [#/sec] (mean)
Time per request:       3.138 [ms] (mean)
Time per request:       0.314 [ms] (mean, across all concurrent requests)
Transfer rate:          320.55 [Kbytes/sec] received

It looks like qps has decreased by 25%. I know deepflow-agent uses ebpf technology

DeepFlow version

deepflow version: v6.2.6 kerne version: 5.15.72 k8s version: v1.18.19

DeepFlow agent list

k8s cluster, echo node have a deepflow-agent pod deepflow-ctl agent list VTAP_ID NAME TYPE CTRL_IP CTRL_MAC STATE GROUP EXCEPTIONS 2 dev-szdl-k8s-slave-5.novalocal-V9 K8S_VM 10.x fe:fc:fe:03:45:ae NORMAL default 3 dev-szdl-k8s-slave-7.novalocal-V1 K8S_VM 10.x fe:fc:fe:68:15:78 NORMAL default 4 dev-szdl-k8s-slave-6.novalocal-V10 K8S_VM 10x fe:fc:fe:5c:b5:0f NORMAL default 5 dev-szdl-k8s-slave-1.novalocal-V7 K8S_VM 10.x fe:fc:fe:71:51:77 NORMAL default 6 dev-szdl-k8s-slave-4.novalocal-V2 K8S_VM 10.x fe:fc:fe:59:5f:6a NORMAL default 7 dev-szdl-k8s-slave-3.novalocal-V3 K8S_VM 10.x fe:fc:fe:6f:fa:eb NORMAL default 8 dev-szdl-k8s-slave-2.novalocal-V8 K8S_VM 10.x fe:fc:fe:43:0a:0b NORMAL default

Kubernetes CNI

calico

Operation-System/Kernel version

from 5.15.72 5.15.72-1.sdc.el7.elrepo.x86_64

Anything else

I know deepflow-agent uses ebpf technology

so I would like to confirm whether deepflow-agent will affect the linux kernel network forwarding performance or CPU performance of programs between clusters. For example, cpu scheduling and network forwarding

and the other test , there is a http post api, compared running DeepFlow-agent to not running in the test When DeepFlow-agent is running, the QPS drops from 5000+ to 2000.+ , it almost -50%

image2023-5-18_20-3-42

Are you willing to submit a PR?

Code of Conduct

kwenzh commented 1 year ago

diss: https://github.com/orgs/deepflowio/discussions/3183

Nick-0314 commented 1 year ago

Hello, eBPF does have some performance overhead. I talked about it in my last live broadcast. We're currently sorting through the data, and we'll have the first version of the data publicly available soon. Also, can you close eBPF or eBPF uprobe to add wechat at the bottom of readme? Let's communicate on wechat.

kwenzh commented 1 year ago

close eBPF or eBPF

ok thank you. Will closing the eBPF probe have any effect? Like network topology monitoring capabilities?

Nick-0314 commented 1 year ago

Network topology is not affected, but distributed tracing

Nick-0314 commented 1 year ago

https://mp.weixin.qq.com/s/oNrTG4ExNOvwV6luPaC4zA

We have some performance test data for reference, and you can also try to turn off the uprobe test of eBPF only @kwenzh

dirtyren commented 1 year ago

Hey guys,

I was doing some performance tests and I think deepflow agent is impacting the throughput and requests per second on a K8S cluster. How thet teste was done: I used a POD running a K6 script from the K8S cluster where is running to a nginx server running on a VM. All tests running Deepflow 6.3.5.

This is the result WITH the deepflow agent running As the image shows the script peaked on 11.37 req/s

image

This is result WITHOUT the deepflow agente running of the same script on the same cluster. I only deleted the daemon set. As the you can see, the test was able to reach 19.1k tests

image

I run this test twice, with and without the agent and the results were the same, very similar request per second. I am running it now for the third time I will post the results here in a few moments.

With agent running on K8S - test redo image

Without agent running on K8S - test redo image

With agent running on K8S - test redo 2 image

### Last 3 tests compared**** Here is a overview of the last 3 tests, as we can see, the response time increase considerably when the deepflow agent was running image

I hope I could help.

kwenzh commented 1 year ago

Here is a overview of the last 3 tests, as we can see, the response time increase considerably when the deepflow agent was running

yes, the same to you , I tried to adjust the deepflow agent parameters, then a little better, maybe you can try it https://deepflow.io/docs/zh/install/advanced-config/agent-advanced-config/

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    disabled: true
Nick-0314 commented 1 year ago

@dirtyren Alternatively, you can try shutting down the eBPF uprobe and testing again

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""
dirtyren commented 1 year ago

@dirtyren Alternatively, you can try shutting down the eBPF uprobe and testing again

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""

I did this config usingn deepflow-ctl but the dashboards are still showing eBPF sources in the last 5 minutes and the performance test yield the same results

vtap_group_id: g-3c66e436c9
log_level: ERROR
tap_interface_regex: '^(tap.*|gke.*|cali.*|veth.*|eth.*|en[ospx].*|lxc.*|lo|[0-9a-f]+_h)$'
external_agent_http_proxy_enabled: 1   # required
external_agent_http_proxy_port: 38086  # optional, default 38086
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""
dirtyren commented 1 year ago

I think the capture_packet_size: 2048 solved my problem, the metrics are very similar with or without the deepflow-agent running

image

kwenzh commented 7 months ago

I think the capture_packet_size: 2048 solved my problem, the metrics are very similar with or without the deepflow-agent running

image

yes, adjust capture_packet_size: 2048 have helps