deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.87k stars 317 forks source link

[BUG] Performance has decreased by three times #4919

Closed zjj1002 closed 3 months ago

zjj1002 commented 10 months ago

Search before asking

DeepFlow Component

Agent

What you expected to happen

our agent config file is like this:

vtap_group_id: g-e7f3f8db93 tap_interface_regex: .* process_threshold: 30 external_agent_http_proxy_enabled: 1 external_agent_http_proxy_port: 38086 static_config: ebpf: thread-num: 5 on-cpu-profile: disabled: true l7-protocol-enabled:

When we started stress testing, after deploying the deepflow agent in the cluster, we found that the average response time of Java microservices decreased by three times, from over 100 MS to 400 MS. We have removed protocols that we may not use in the agent and tried to increase CPU and memory configurations, but it did not help We even tried l4 Log Tap_ Types: -1, but still not helpful We are using the latest version of 6.4.3 (agent/server). Is this the original performance of the agent?

How to reproduce

No response

DeepFlow version

6.4.3

DeepFlow agent list

Daemonsets 7 pods

Kubernetes CNI

Antrea

Operation-System/Kernel version

4.1.19

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

sharang commented 10 months ago

Hello, there are some questions about this issue that need to be clarified:

  1. What tool did you use for the benchmark? We have used wrk2 (https://github.com/giltene/wrk2), but found that this tool distorts RT under extreme TPS pressure.
  2. Which testing method did you use: A) Fixed TPS, testing with and without running deepflow-agent; B) Testing maximum TPS in scenarios with and without running deepflow-agent. If it's the first method, please confirm that no single logical core ran at 100% during the test, as reaching 100% could likely cause a significant decay in RT. If it's the second method, it usually means that definitely a single core ran at 100%. We typically use the first method for testing and ensure that no single core runs at 100% while running deepflow-agent. For our testing method, please refer to: https://deepflow.io/blog/zh/030-deepflow-agent-ebpf-benchmark/
  3. What was the TPS during the stress test?
  4. During the testing process, how was the overall CPU usage and system load of the machine?
  5. Can you confirm that your kernel version is 4.1.19? I'm concerned it might be a typo, as it looks like 4.19.
sharang commented 3 months ago

Due to the lack of response for a long time, this issue will be closed.