Closed Chrisbattarbee closed 6 months ago
Let me know if you need any help reproducing by the way @taloric, I'm around to help
@Chrisbattarbee Hello, thanks for the issue. I've checked the manifest.yaml in the repo and it's simple for reproduced.
Infact, there are some issues during this distributed tracing but it's not caused by tcp conn re-used, we found that the ingress request and egress request in Microservice B
is cross-thread which leads to link failed.
But we also found the response
captured in egress and ingress are in the same thread. We'll open a pull request to fix that soon, then I'll link this issue to it.
Hey @taloric so I took a look at the threads making requests and I found the following
Poller Client Request Thread: 4075169 Poller Client Response Thread: 4075169
A server Request Thread: 4075133 A server Response Thread: 4075186 A client Request Thread: 4075186 A client Response Thread: 4075186
B server Request Thread: 1548971 B server Response Thread: 1548989 B client Request Thread: 1548989 B client Response Thread: 1548989
C server Request Thread: 4075100 C server Response Thread: 4075177
So that looks to me like the following as a diagram:
So it looks like the threading structure is the same across A and B, so I'm not sure why we end up with a broken trace if there's no difference in threading structure and TCP connection re-use is not the issue?
@Chrisbattarbee In fact, in the not re-use
scenario, request from MicroService B
will send a DNS request before every request to Microservice C
, so we link B
with C
through this DNS request.
But in re-use
scenario, this link has been dropped due to it's a single HTTP request without DNS, it was for robustness, but it's maybe too strict for this scenario
@taloric Gotcha. So at my basic level of understanding, is the following true: "For all requests started from a different thread than the receiving thread, if there is no dns query made before the request, the traces will not be linked?"
For what it's worth, I think I've seen this breaking a few other situations too: In the go checkout service in the otel demo, it creates long lived clients which seem to have the same behaviour so therefore the traces were broken there too. That's what made me investigate this. https://github.com/open-telemetry/opentelemetry-demo/blob/main/src/checkoutservice/main.go#L169-L198
@Chrisbattarbee The main reason why the trace is broken is because the ingress request-response and egress request-response are in two threads. However, the scenario in your demo is a bit special. We found that the thread structure in each process is as follows (please @taloric confirm):
C
C
In this case, we can use thread C
to associate the entire trace. However, there was a piece of logic in our code that could be deprecated and blocked this association. We have just deleted it in PR #6386 .Of course, this scenario is just a coincidence. If there is no overlap between the ingress and egress threads, we cannot complete the tracing.
However, the Golang scenario you mentioned is a bit different. Since the Golang runtime uses goroutines, the threads seen by the kernel are actually meaningless. We have done some work to associate goroutines using eBPF uprobe (by tracking the creation relationship between goroutines, associate them, and construct a pseudo-thread). However, this association mechanism currently has defects for long-live goroutines (our default configuration parameter is 120s), so it cannot perfectly solve the TCP conn re-use scenario.
For all requests started from a different thread than the receiving thread, if there is no dns query made before the request, the traces will not be linked?
Yes, this is because our deprecated code (in #6386) only did special handling for HTTP 1, and DNS does not fall into the logic of this code, so the tracking is complete. However, I have not yet looked at whether the thread structure is different for short TCP conns and the TCP conn re-use scenario. If there is no difference, then it is the reason I mentioned.
Gotcha, on the thread structure:
We found the following assuming by ingress you mean server and by egress you mean client in the deep-flow spans.
ingress request: thread A egress request: thread B egress response: thread B ingress response: thread B
In this case A is 4075133 and B is 4075186 @sharang
ingress request: thread A egress request: thread B egress response: thread B ingress response: thread B
ok, got it
New patch is working!
Search before asking
DeepFlow Component
Agent
What you expected to happen
Based on the conversation here: https://discord.com/channels/1069828209773907978/1069828210419838998/1238030162902384711 Deepflow should be able to trace requests through a re-used tcp connection.
Based on my experiments, I can't get it to do this.
There's a full write up on expected behaviour here: https://github.com/metoro-io/metoro-observability/tree/bd0f0bcbfc6ffe960efee9ad5c796fa0f47215b0/test/demos/python_microservices
(Copying it out)
Poller -> MicroServiceA -> MicroServiceB -> MicroServiceC
Deepflow is able to trace the calls when TCP reuse is disabled, but when it is enabled, the calls are not traced.
By default, the TCP reuse is enabled, to disable it, you can set the environment variable
REUSE_SESSION
tofalse
when running the services.To run the example with TCP reuse disabled (deepflow working), run the following:
Example correct trace:
To run the example with TCP reuse enabled between microservice-b and microservice-c (deepflow not working), run the following:
Example broken trace:
(connection between microservice b and c not connected)
How to reproduce
https://github.com/metoro-io/metoro-observability/tree/bd0f0bcbfc6ffe960efee9ad5c796fa0f47215b0/test/demos/python_microservices Contains a readme with a full example of expected behaviour, actual behaviour and how to reproduce
DeepFlow version
DeepFlow agent list
Kubernetes CNI
Amazon VPC CNI (It's an EKS cluster)
Operation-System/Kernel version
Anything else
This happens every time for me, reproducibly.
Are you willing to submit a PR?
Code of Conduct