DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
586 stars 290 forks source link

Header injection not working for spring webclient #3366

Closed vsingal-p closed 1 year ago

vsingal-p commented 2 years ago

I am facing an issue with using auto instrumentation with spring webflux + webclient. Headers are not being propagated to downstream call.

In the following commit, propagate() line was removed. I suspect this could be the cause. However I couldn't verify as I am not able to build dd-java-agent locally with changes. https://github.com/DataDog/dd-trace-java/pull/1876/files#diff-b53db5fbb0c565c82f99286c8d966b127e12ba58525f4a2fdf69a4b01832f902L55

Versions spring-webflux : 5.3.13 dd-java-agent: 0.88.0

charliegracie commented 2 years ago

Is it possible to provide an example of how you are making the webclient call?

Also, I am curious if the following option improves the situation: -Ddd.integration.reactor-hooks.enabled=true

vsingal-p commented 2 years ago

I am uploading a sample project - https://github.com/vsingal-p/dd-java-agent-webclient-issue-demo

Here are the jvm args

-javaagent:dd-java-agent.jar -Ddd.propagation.style.inject=B3 -Ddd.propagation.style.extract=B3 -Ddd.integration.reactor-hooks.enabled=true

Here's the request I make

curl --location --request GET 'http://localhost:8080/demo' \
--header 'x-b3-traceid: 3469035142037265' \
--header 'x-b3-spanid: 803698824837'

I am facing 2 issues -

devinsba commented 2 years ago

The first issue is because b3 IDs need to be 16 characters[^1] hex encoded. The equivalents of what you posted above are: 000C5311AF5ABF11, 000000BB202ED685

We will have to look into the second

[^1]: Trace ID can be either 16 or 32

vsingal-p commented 2 years ago

Okay, thanks. I was skeptical but couldn't find any sample

Here's the project - https://github.com/vsingal-p/dd-java-agent-webclient-issue-demo ~For some reason I cannot edit my previous comment~ Updated original as well

vsingal-p commented 2 years ago

The hex strings worked (partially). Using them I am able to get same dd.trace_id in all calls made to server. However I get different dd.span_id for each call. Is that expected? Shouldn't it use x-b3-spanid ?

devinsba commented 2 years ago

Good catch, x-b3-span ID in our implementation is used as the parent ID for the server span

vsingal-p commented 2 years ago

Any reason we don't support all 3 headers in dd-java-agent?

x-b3-traceid
x-b3-spanid
x-b3-parentspanid
devinsba commented 2 years ago

We don't support the zipkin shared spans concept, in zipkin the "transport" span is shared by the client and server, which is why spanid and parent are needed so that the server side copy of the span knows it's parent. Since we don't share spans, we only need the span ID from the client side and use that as the server side parent ID.

Ref: https://github.com/openzipkin/brave/blob/8035d01b358bbda1576b19530165a78108c138cf/brave/src/test/java/brave/TracerTest.java#L213

vsingal-p commented 2 years ago

Gotcha! Makes sense. I was just getting easy with sleuth+brave, so found this change in datadog a bit weird.

Summarising the issues till now -

devinsba commented 2 years ago

You can configure sleuth the have the same behavior by using spring.sleuth.supports-join=false in case you need to run in a mixed environment

vsingal-p commented 2 years ago

Sorry for barging in heavy, but facing one more 😓

When I make 2 webclient calls one after the other, dd.span_id changes in MDC. span_id for a server call should not change throughout the call I guess.

2022-02-07 22:05:14,063 thread=[reactor-http-nio-4] level=INFO logger=WebClientApi dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=8160433720268064940 

GET:

URL:https://google.com:

Headers:[]

Attributes:{org.springframework.web.reactive.function.client.WebClient.uriTemplate=https://google.com, org.springframework.web.reactive.function.client.ClientRequest.LOG_ID=ccedc68}

-2022-02-07 22:05:15,958 thread=[reactor-http-nio-4] level=INFO logger=WebClientApi dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=8449475067196755785 

GET:

URL:https://google.com:

Headers:[]

Attributes:{org.springframework.web.reactive.function.client.WebClient.uriTemplate=https://google.com, org.springframework.web.reactive.function.client.ClientRequest.LOG_ID=30bfd31d}
devinsba commented 2 years ago

Ah, OK. I see now what you are seeing. The reason there aren't headers in the webflux request object is that we add them in the netty instrumentation. If you send the request to https://httpbin.org/get and print the response you should see headers in the payload that was sent

vsingal-p commented 2 years ago

I see. Interesting! I have been scratching my head around this till now 😲 So the main concern for which this issue was raised is working fine 👍

Remaining open concerns -

~Also is there any way we can see the spanId for the client call being made in logs?~ (DD does not share parent-child span so no span_id is generated for client call)

vsingal-p commented 2 years ago

@devinsba thoughts on above issues?

devinsba commented 2 years ago

Span ID on client side becomes Parent Span ID on server side.

Each webclient call will get it's own Span ID so the MDC will be different.

As for logging the span ID for the client call, you probably need to turn on request logging for netty. I'm not sure the best way to do so

vsingal-p commented 2 years ago

Span ID on client side becomes Parent Span ID on server side.

Got it. so dd.span_id in MDC should correspond to this parent span, right? so that all the logs within that server call can be correlated

Each webclient call will get it's own Span ID so the MDC will be different.

I meant dd.span_id in server logs changes upon multiple webclient calls. That shouldn't happen. (resonates with above comment)

server receives call with x-b3-traceid=abc123456def x-b3-spanid=zzz00000www
log.info("Log1")
log.info("Log2")
webclient.downstreamCallOne()
log.info("Log3")
log.info("Log4")
webclient.downstreamCallOne()

Above snippet generates logs

logger=LoggingServerStartInterceptor level=info dd.trace_id=abc123456def dd.span_id=xxx12345yyy message=Server call start
logger=demo level=info dd.trace_id=abc123456def dd.span_id=xxx12345yyy message=Log1
logger=demo level=info dd.trace_id=abc123456def dd.span_id=xxx12345yyy message=Log2
logger=webclientInstance level=info dd.trace_id=abc123456def dd.span_id=xxx67890yyy message=call succeeded
logger=demo level=info dd.trace_id=abc123456def dd.span_id=xxx67890yyy message=Log1
logger=demo level=info dd.trace_id=abc123456def dd.span_id=xxx67890yyy message=Log1
logger=webclientInstance level=info dd.trace_id=abc123456def dd.span_id=xxx334455yyy message=call succeeded
logger=LoggingServerFinishInterceptor level=info dd.trace_id=abc123456def dd.span_id=xxx334455yyy message=Server call complete

I believe dd.span_id should remain same throughout the server call? which should be xxx12345yyy

devinsba commented 2 years ago

Ok, that does sound like a bug. If you can create a reproducer then we can look at making a failing test to work against

vsingal-p commented 2 years ago

Sure, let me work something with the demo app I created above to reproduce this

vsingal-p commented 2 years ago

Updated the project with a sample code - https://github.com/vsingal-p/dd-java-agent-webclient-issue-demo.git

jvm args

-javaagent:dd-java-agent.jar -Ddd.propagation.style.inject=B3 -Ddd.propagation.style.extract=B3 -Ddd.integration.reactor-hooks.enabled=true

Request

curl --location --request GET 'http://localhost:8080/demo' \
--header 'x-b3-traceid: 000C5311AF5ABF11' \
--header 'x-b3-spanid: 000000BB202ED685'

Logs

2022-02-09 01:24:12,130 thread=[reactor-http-nio-4] level=INFO logger=LoggingServerInterceptor dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=7662802724461287617 Server call starts
-2022-02-09 01:24:12,679 thread=[reactor-http-nio-4] level=INFO logger=DemoController dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=7662802724461287617 Log1
-2022-02-09 01:24:12,679 thread=[reactor-http-nio-4] level=INFO logger=DemoController dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=7662802724461287617 Log2
-2022-02-09 01:24:15,229 thread=[reactor-http-nio-4] level=INFO logger=WebClientApi dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=4419075444443843519 webclient call complete
-2022-02-09 01:24:15,295 thread=[reactor-http-nio-4] level=INFO logger=DemoController dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=7662802724461287617 Log3
-2022-02-09 01:24:15,295 thread=[reactor-http-nio-4] level=INFO logger=DemoController dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=7662802724461287617 Log4
-2022-02-09 01:24:16,369 thread=[reactor-http-nio-4] level=INFO logger=WebClientApi dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=4451157002518768285 webclient call complete
-2022-02-09 01:24:16,398 thread=[reactor-http-nio-4] level=INFO logger=LoggingServerInterceptor dd.trace_id=3469035142037265, dd.service=com.example.demo.DdJavaAgentWebclientDemoApplicationKt, dd.span_id=994518264780377097 Server call complete

There are different from the example I gave above. I am trying to reproduce that scenario, in the meanwhile above should also work as seems like it also has some discrepancies

vsingal-p commented 2 years ago

The original example I gave was from a springboot + grpc + kotlin app. It's based on my organisation's SDK and it's a bit difficult to replicate the same. Let me know if above example (springboot + reactor + kotlin) is not sufficient. I'll try to create a springboot-grpc-kotlin demo in that case

vsingal-p commented 1 year ago

Any update on this? We are still facing the issue in kotlin+coroutines application. I see other open issues as well https://github.com/DataDog/dd-trace-java/issues/1123 and https://github.com/DataDog/dd-trace-java/issues/931

Can we get this picked up? My org has shifted to kotlin+coroutines for all our apps and this issue is a major blocker

vsingal-p commented 1 year ago

^ Kotlin integration support has been added which has fixed this issue