deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.97k stars 333 forks source link

[BUG] Redis unpacking exception when delay is seconds #4671

Open rosenlo opened 1 year ago

rosenlo commented 1 year ago

Search before asking

DeepFlow Component

Agent

What you expected to happen

Theoretically, Redis is not prone to second-level response, but when a second-level response occurs, we hope that the unpacking is correct.

How to reproduce

We found out that redis had many second-level delays and were curious, so we captured and analyzed the packets and found that the second-level data req_tcp_seq and resp_tcp_seq did not match, as shown in the following figure:

The l7 flow log

image

The request packet

image

The wrong response packet

image

The correct response packet

image

By the way, the millisecond response unpacking is correct, only appears in the second response.

DeepFlow version

Name: deepflow-server community edition
Branch: master
CommitID: 21e985d71c41021c86c57a1cbf7a8aa7a1890a33
RevCount: 9710
Compiler: go version go1.19.10 linux/amd64
CompileTime: 2023-11-06 13:22:03
Name: deepflow-agent community edition
Branch: master
CommitId: 5c7b557ab2540c03d023cf8b658ee265e2cc3722
RevCount: 9347
Compiler: rustc 1.72.0 (5680fa18f 2023-08-23)
CompileTime: 2023-09-21 13:14:47

DeepFlow agent list

No response

Kubernetes CNI

aws-vpc-cni

Operation-System/Kernel version

Ubuntu 22.04.3 LTS
5.10.192-183.736.amzn2.x86_64

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

sharang commented 11 months ago

This issue seems to have been unaddressed for a long time, and I'm not sure if there is still an environment available for investigation.

From your description, this phenomenon seems very strange. Since Redis is a serial protocol, a response should always follow a request. My guess is as follows:

The actual sequence of packets might be:

#18441: Redis Request for query#1, TCP Seq = 1050818855
#18443: Redis Response for query#1, TCP Seq = 2252530236       // the correct response
#?????: Redis Request for query#2, TCP Seq = ??
#20521: Redis Response for query#2, TCP Seq = 2252531121        // the wrong response

I think it might be possible that due to some reason, such as CPU or memory shortage, the deepflow-agent did not capture the two middle packets, causing the first and fourth packets to be incorrectly matched as one query.

If this situation still occurs, you can investigate as follows:

  1. Use resp_tcp_seq = 2252530236 (i.e., the correct response's TCP Seq) to query in l7_flow_log to see if there are any results.
  2. Check Grafana Dashboard - DeepFlow System - DeepFlow Agent - Performance (Packet Capturing from Kernel) - Packet Drops in Kernel (Agent Can't Process)

Also, by the way, what do you mean by second-level response?