DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
639 stars 300 forks source link

Kotlin coroutine instrumentation missing spans in trace #7215

Closed wcchoi closed 1 month ago

wcchoi commented 10 months ago

Hi, I am noticing some spans in a trace are missing when enabling Kotlin coroutine instrumentation, eg when withTimeout or flowOn are used

Without enabling Kotlin coroutine instrumentation, traces are random because the datadog java agent uses ThreadLocals to store some states and coroutine could move between threads

This is a repo to reproduce the issue https://github.com/wcchoi/test-datadog-agent-coroutine

ericloe-cash commented 3 months ago

As I understand the issue having looked into this a bit, the primary cause of this is the datadog tracing agent making assumptions that job completion is equivalent to thread context switching.

runBlocking (MDCContext()) {
  // do stuff

  coroutineScope { 
    // do more stuff

  } // Job ends here and dd terminates the scope on the thread

  // underlying kotlin coroutines do not restore thread context here, it's resumed undispatched and the thread does not change.
  // trace_id is missing until some context switch occurs.
}

This works:

runBlocking (MDCContext()) { // This is coroutine#1 with a job
  // do stuff

  withContext(MdcContext() {
    coroutineScope { // This is still coroutine#1 with a parent job of UndispatchedCoroutine
      // do more stuff

    }
  }  // Job ends here and dd terminates the scope on the thread

  // underlying kotlin coroutines restores thread context here and the span is restored.
  // trace_id correct
}

I dont really see why there is a hook on both thread context and job completion?
Just managing thread context should suffice? These two concepts are mixing and creating unpredictable results.

ericloe-cash commented 3 months ago

Hi @PerfectSlayer - thanks for officially documenting the broken behavior in #8238

Is there any visibility on a timeline for a fix? We are actively working to support greater usage of coroutines in our services but this is a major blocker for broader adoption. Unfortunately previous attempts at working around this actually introduced worse runtime bugs that no one really noticed and have spent a lot effort trying to diagnose and work around this to no avail.

Would be great to adopt an official solution!

Any out of the box ideas on a workaround? I had played around with disabling async propagation and manually managing scopes, but it seems the internal ScopeStateCoroutineContext is doing some things that I dont have access to in the public API surface. Simply activating and closing Spans was not resulting in the dd tracer carrying span across threads, presumably due to some internal thread local we cant manage around.

TIA!

ericloe-cash commented 3 months ago

Hi @PerfectSlayer - just bumping this thread - really hoping for a fix soon on this.

PerfectSlayer commented 2 months ago

Is there any visibility on a timeline for a fix?

There are active development about context tracking, which will allow open to better support for Koltin coroutine context tracking but no ETA to share yet. It is still something we want and plan to address.

mcculls commented 1 month ago

Hi @wcchoi @ericloe-cash this will be fixed in the upcoming v1.48.0 release by https://github.com/DataDog/dd-trace-java/pull/8651

If you'd like to try a pre-release snapshot you can use https://oss.sonatype.org/content/repositories/snapshots/com/datadoghq/dd-java-agent/1.48.0-SNAPSHOT/dd-java-agent-1.48.0-20250404.112215-102.jar

github-actions[bot] commented 1 month ago

:robot: This issue has been addressed in the latest release. Check full details in the Release Notes.

ericloe-cash commented 1 week ago

Hi @mcculls @PerfectSlayer - I've tried this out and I think things are better but I'm seeing issues with the MDC trace/span id propagation.

Anytime there is a scope removed the MDC is cleared, any time a scope is added the MDC is updated. So it seems when there's a depth > 1 and a scope is removed the MDC is not updated. I know from the other MDC values that the custom values are following what's expected from MDCContext.

    logDetails("before runBlocking")
    runBlocking(MDCContext()) {
      logDetails("before withTimeout", coroutineContext)
      withTimeout(1000) {
        logDetails("before delay", coroutineContext)
        delay(5)

        // Original MDC is restored here on delay resumption
        logDetails("after delay", coroutineContext) 
      }
      // MDC is preserved from "after delay", but trace_id/span_id is cleared.
      logDetails("after withTimeout", coroutineContext)
    }

    // Original MDC is restored here on on resumption 
    logDetails("after runBlocking")
  fun logDetails(log: String, coroutineContext: CoroutineContext? = null) {
      val activeTrace = CorrelationIdentifier.getTraceId()
      val activeSpan = CorrelationIdentifier.getSpanId()

      MDC.put("cash.test.${id.incrementAndGet()}", log)

      logger.info(
        StringBuilder()
          .appendLine(log)
          .appendLine("active_trace: $activeTrace")
          .appendLine("active_span: $activeSpan")
          .appendLine("mdc: ${MDC.getCopyOfContextMap()}")
          .toString()
      )
    }

I added a scope listener so I can see that the active trace/spans seem to be holding correctly, but the log correlation is lost.

      val ddTracer = DDGlobalTracer.get()
      val internalTracer = ddTracer as? InternalTracer
      internalTracer?.addScopeListener({ logDetails("scopeAdded") }, { logDetails("scopeRemoved") })
      ddTracer.addTraceInterceptor(
        object : TraceInterceptor {
          override fun onTraceComplete(trace: MutableCollection<out MutableSpan>?): MutableCollection<out MutableSpan> {
            logDetails("onTraceCompleted: ${trace}")

            return trace ?: mutableListOf()
          }

          override fun priority(): Int = 98122
        }
      )
Image
ericloe-cash commented 1 week ago

Never mind - our framework was adding another scope listener that cleared the MDC 🤦‍♂

ericloe-cash commented 1 week ago

@PerfectSlayer @mcculls

Though looking at LogContextScopeListener it seems like this might be the expected implementation. Should afterScopeClosedCallback be called if we are exiting a nested scope? Since we don't get a corresponding afterScopeActivatedCallback we are in kind of grey area. The "fix" for removing the trace ids is to no-op here.

I also tried setting the MDC to the current scope, but it seems the tracer is always reporting the previously active scope in afterScopeClosedCallback

fun afterScopeActivated() {
    try {
      MDC.put(MDC_TRACE_ID, CorrelationIdentifier.getTraceId())
      MDC.put(MDC_SPAN_ID, CorrelationIdentifier.getSpanId())
      MDC.put(MDC_DD_TRACE_ID, CorrelationIdentifier.getTraceId())
      MDC.put(MDC_DD_SPAN_ID, CorrelationIdentifier.getSpanId())
    } catch (e: Exception) {
      log.debug("Exception setting log context context", e)
    }
  }

  fun afterScopeClosed() {
    try {
      MDC.put(MDC_TRACE_ID, CorrelationIdentifier.getTraceId())
      MDC.put(MDC_SPAN_ID, CorrelationIdentifier.getSpanId())
      MDC.put(MDC_DD_TRACE_ID, CorrelationIdentifier.getTraceId())
      MDC.put(MDC_DD_SPAN_ID, CorrelationIdentifier.getSpanId())
    } catch (e: Exception) {
      log.debug("Exception removing log context context", e)
    }
  }
Image
mcculls commented 1 week ago

Hi @ericloe-cash - afterScopeClosed is called whenever a scope is finally closed, and you should be getting a balance of afterScopeClosed and afterScopeActivated calls.

I've found a couple of edge-cases during further testing where the current scope doesn't match what we'd expect, which could affect these callbacks. I'm updating our coroutines instrumentation to handle these edge-cases.