DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
570 stars 283 forks source link

Traces not captured for webflux netty.request when request is canceled #3503

Open deisenzimmer opened 2 years ago

deisenzimmer commented 2 years ago

We use the Spring WebFlux project to develop Reactive Java API's. Our API is called by an AWS load balancer having a 30 second timeout. Traces work great under normal circumstances. Some of our API's in turn call one or more API's hosted by partners. When the partner is having latency issues, sometimes the load balancer times out and a cancel signal is propagated via reactive. In that scenario traces are not sent to the trace agent and we are challenged in trying to detect and diagnose the timeouts.

I saw this issue which seems to be similar and is closed as fixed: https://github.com/DataDog/dd-trace-java/issues/1100

I am running version 0.99.0~c3bbd7bd60.

I enabled debugging, and under the normal scenario we have a pair of log entries show up for the root span like so:

[dd.trace 2022-04-15 14:30:44:817 -0500] [reactor-http-epoll-1] DEBUG datadog.trace.agent.core.DDSpan - Started span: DDSpan [ t_id=597002061400167561, s_id=5087710417904218129, p_id=0 ] trace=apitest/netty.request/netty.request metrics= tags={language=jvm, runtime-id=4a329753-dab8-4784-b73c-72400c7069e4, thread.id=32, thread.name=reactor-http-epoll-1}, duration_ns=0 [dd.trace 2022-04-15 14:30:48:294 -0500] [reactor-http-epoll-1] DEBUG datadog.trace.agent.core.DDSpan - Finished span (ROOT_BUFFERED): DDSpan [ t_id=597002061400167561, s_id=5087710417904218129, p_id=0 ] trace=apitest/netty.request/POST /v4/test metrics= errored measured tags={_dd.agent_psr=1.0, _sample_rate=1, component=netty, http.hostname=api.local.myfico.com, http.method=POST, http.route=/v4/test, http.status_code=500, http.url=http://api.local.myfico.com:30003/v4/test, language=jvm, peer.ipv4=192.168.65.3, peer.port=59924, runtime-id=4a329753-dab8-4784-b73c-72400c7069e4, span.kind=server, thread.id=32, thread.name=reactor-http-epoll-1}, duration_ns=3476727600

In the timeout scenario, we only have the start span log entry show up for the root span: [dd.trace 2022-04-15 14:21:36:343 -0500] [reactor-http-epoll-3] DEBUG datadog.trace.agent.core.DDSpan - Started span: DDSpan [ t_id=6642874906945167855, s_id=3286022000880915977, p_id=0 ] trace=apitest/netty.request/netty.request metrics= tags={language=jvm, runtime-id=4a329753-dab8-4784-b73c-72400c7069e4, thread.id=39, thread.name=reactor-http-epoll-3}, duration_ns=0

We do get two odd log entries which I wonder if they are related: [dd.trace 2022-04-15 14:21:37:659 -0500] [reactor-http-epoll-3] DEBUG datadog.trace.agent.tooling.Instrumenter$Default - Muzzled - instrumentation.names=[netty,netty-4.0] instrumentation.class=datadog.trace.instrumentation.netty40.ChannelFutureListenerInstrumentation instrumentation.target.classloader=jdk.internal.loader.ClassLoaders$AppClassLoader@9e89d68 [dd.trace 2022-04-15 14:21:37:659 -0500] [reactor-http-epoll-3] DEBUG datadog.trace.agent.tooling.Instrumenter$Default - Muzzled mismatch - instrumentation.names=[netty,netty-4.0] instrumentation.class=datadog.trace.instrumentation.netty40.ChannelFutureListenerInstrumentation instrumentation.target.classloader=jdk.internal.loader.ClassLoaders$AppClassLoader@9e89d68 muzzle.mismatch="datadog.trace.instrumentation.netty40.AttributeKeys:38 Missing method io.netty.util.AttributeKey#(Ljava/lang/String;)V"

Is it possible to capture the traces correctly when the client times out and a reactive cancel signal is propagated?

devinsba commented 2 years ago

What version of webflux are you on? It's possible this is an issue that re-emerged with changes to reactor and reactor-netty, or we never fully fixed it but parts of this instrumentation has been rewritten since the above PR

deisenzimmer commented 2 years ago

We use spring-boot-starter-webflux 2.6.6 which looks to pull in spring-webflux 5.3.18.

One oddity is that the "muzzled mismatch" log entry refers to a netty40 classpath but the starter brings in netty 4.1.72. Is that mismatch to be expected?

devinsba commented 2 years ago

Yes, that mismatch would be expected since we have an alternate implementation for netty that covers 4.1+

Thanks for the version, that will help triage the issue

howardem commented 1 year ago

Hi @devinsba we are running exactly into the same problem. All of our microservices are implemented using reactive Spring Boot / Webflux. When a client closes the connection due to some timeout connecting to Redis for example, we are able to see the reactor.netty.channel.AbortedException exception in the pod console, but those errors are never shown in Datadog.

We are running containerized reactive Spring Boot 2.7.15 applications running on Java 17 (Eclipse Adoptium 17.0.8+7)

howardem commented 1 year ago

Hi @deisenzimmer, were you able to figure out how to solve the mismatch issue with Datadog?

deisenzimmer commented 1 year ago

Hi @howardem, we never did solve the mismatch issue nor the issue overall. I think it's still occurring.

howardem commented 1 year ago

Hi @deisenzimmer thanks for your response! Unfortunately the issue is still happening. We had it back when were using the same stack as you, reactive Spring Boot 2.6.x, but even with Spring Boot 2.7.15 which relies on Netty 4.1.97, the problem hasn't been resolved by the Datadog folks.

howardem commented 12 months ago

Hi @devinsba, it has been almost 4 years since this issue No APM traces logged on canceled requests in Reactor was reported and it's still happening. Does Datadog Engineering Team have any plan or the will to solve this problem any time soon? A lot of development teams use WebFlux for their reactive Spring Boot microservices. And when it comes to Datadog, as exhibitor for the latest VMware Explore 2023 but also present in previous editions of Spring One Platform, in my honest opinion, Datadog, should've come up with solution long time ago.

This bug is a pressing issue for Engineering Teams using DD.

cc: @mcculls, @richardstartin, @bantonsson