[BUG] Performance has decreased by three times

zjj1002 commented 10 months ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

DeepFlow Component

Agent

What you expected to happen

our agent config file is like this:

vtap_group_id: g-e7f3f8db93 tap_interface_regex: .* process_threshold: 30 external_agent_http_proxy_enabled: 1 external_agent_http_proxy_port: 38086 static_config: ebpf: thread-num: 5 on-cpu-profile: disabled: true l7-protocol-enabled:

HTTP ## for both HTTP and HTTP_TLS
MySQL
Redis
Kafka l7_log_collect_nps_threshold: 100000 thread_threshold: 1000 l4_log_tap_types:
- 0 l7_log_packet_size: 1500 http_log_trace_id: traceparent,sw8,x-b3-traceid http_log_span_id: traceparent, sw8,x-b3-traceid http_log_proxy_client: 关闭

When we started stress testing, after deploying the deepflow agent in the cluster, we found that the average response time of Java microservices decreased by three times, from over 100 MS to 400 MS. We have removed protocols that we may not use in the agent and tried to increase CPU and memory configurations, but it did not help We even tried l4 Log Tap_ Types: -1, but still not helpful We are using the latest version of 6.4.3 (agent/server). Is this the original performance of the agent?

How to reproduce

No response

DeepFlow version

6.4.3

DeepFlow agent list

Daemonsets 7 pods

Kubernetes CNI

Antrea

Operation-System/Kernel version

4.1.19

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

sharang commented 10 months ago

Hello, there are some questions about this issue that need to be clarified:

What tool did you use for the benchmark? We have used wrk2 (https://github.com/giltene/wrk2), but found that this tool distorts RT under extreme TPS pressure.
Which testing method did you use: A) Fixed TPS, testing with and without running deepflow-agent; B) Testing maximum TPS in scenarios with and without running deepflow-agent. If it's the first method, please confirm that no single logical core ran at 100% during the test, as reaching 100% could likely cause a significant decay in RT. If it's the second method, it usually means that definitely a single core ran at 100%. We typically use the first method for testing and ensure that no single core runs at 100% while running deepflow-agent. For our testing method, please refer to: https://deepflow.io/blog/zh/030-deepflow-agent-ebpf-benchmark/
What was the TPS during the stress test?
During the testing process, how was the overall CPU usage and system load of the machine?
Can you confirm that your kernel version is 4.1.19? I'm concerned it might be a typo, as it looks like 4.19.

sharang commented 3 months ago

Due to the lack of response for a long time, this issue will be closed.

deepflowio / deepflow