DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
575 stars 287 forks source link

StackOverflowError with Datadog agent and Spring Webflux project #1742

Closed htamas closed 3 years ago

htamas commented 4 years ago

We have an application built on Java 11 + Spring Webflux (reactive) + Netty that's instrumented by Datadog. Upgrading the dd-java-agent from 0.45.0 to >0.55.1 seems to cause a very serious issue in which the request chain is never completed (!!).

There is no logging of the error, and also no trace logged in DataDog itself, no logs output to stdout etc. It simply just dies - as if it never happened. We were eventually able to further pinpoint the issue by hooking up a remote debugger.

What happens is that the host in question works just fine initially, but starts failing all requests after some time serving production traffic. The request is never fulfilled by the instrumented server, and the client never sees a response. Once a host is in this state, it seems to be stuck there and is symptomatic for all subsequent requests. Debugging reveals that downstream calls (using the Spring reactive WebClient) actually fail silently with a StackOverflowError with what seems like a circular subscription scenario in the DataDog tracing code. The breakpoint is set at the (reactive) execution of the downstream call.

The error in question:

java.lang.StackOverflowError
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
// ... and so on...

Downgrading the service to datadog agent 0.45.0 seems to fix the issue.

Some app details: Sprint Boot 2.2.7.RELEASE (Spring Cloud Gateway) Log4j2 + Logback Spring Cloud Sleuth (disabling this made no difference) We are using automatic instrumentation Our trace config is:

DD_TRACE_DB_CLIENT_SPLIT_BY_INSTANCE=true
DD_TRACE_ANALYTICS_ENABLED=true

We already confirmed the case by Datadog support, but since it seems hard to get to the bottom of the issue, we hope to have a discussion with the public who had similar issues.

htamas commented 4 years ago

Could be related to https://github.com/DataDog/dd-trace-java/issues/1306 ?

devinsba commented 4 years ago

Could you post as much of the stacktrace as you can? This is likely some odd interaction with the internals of reactor, so we will likely need to chat with them a bit. Being able to provide as much info as possible would be great

htamas commented 4 years ago

The stacktrace loops over these 4 lines all the way down:

java.lang.StackOverflowError
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:137)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:132)
    at datadog.trace.instrumentation.reactor.core.TracingSubscriber.request(TracingSubscriber.java:114)
htamas commented 4 years ago

Created a pastebin for the "full" stacktrace, as it is more than a thousand lines: https://pastebin.com/jJD0uJRk

randomanderson commented 3 years ago

This issue was likely fixed with #1876 and #2037. Could you please try the 0.67.0 release?

htamas commented 3 years ago

@randomanderson yes this helped. Thanks