Closed wcchoi closed 1 month ago
As I understand the issue having looked into this a bit, the primary cause of this is the datadog tracing agent making assumptions that job completion is equivalent to thread context switching.
runBlocking (MDCContext()) {
// do stuff
coroutineScope {
// do more stuff
} // Job ends here and dd terminates the scope on the thread
// underlying kotlin coroutines do not restore thread context here, it's resumed undispatched and the thread does not change.
// trace_id is missing until some context switch occurs.
}
This works:
runBlocking (MDCContext()) { // This is coroutine#1 with a job
// do stuff
withContext(MdcContext() {
coroutineScope { // This is still coroutine#1 with a parent job of UndispatchedCoroutine
// do more stuff
}
} // Job ends here and dd terminates the scope on the thread
// underlying kotlin coroutines restores thread context here and the span is restored.
// trace_id correct
}
I dont really see why there is a hook on both thread context and job completion?
Just managing thread context should suffice? These two concepts are mixing and creating unpredictable results.
Hi @PerfectSlayer - thanks for officially documenting the broken behavior in #8238
Is there any visibility on a timeline for a fix? We are actively working to support greater usage of coroutines in our services but this is a major blocker for broader adoption. Unfortunately previous attempts at working around this actually introduced worse runtime bugs that no one really noticed and have spent a lot effort trying to diagnose and work around this to no avail.
Would be great to adopt an official solution!
Any out of the box ideas on a workaround? I had played around with disabling async propagation and manually managing scopes, but it seems the internal ScopeStateCoroutineContext is doing some things that I dont have access to in the public API surface. Simply activating and closing Spans was not resulting in the dd tracer carrying span across threads, presumably due to some internal thread local we cant manage around.
TIA!
Hi @PerfectSlayer - just bumping this thread - really hoping for a fix soon on this.
Is there any visibility on a timeline for a fix?
There are active development about context tracking, which will allow open to better support for Koltin coroutine context tracking but no ETA to share yet. It is still something we want and plan to address.
Hi @wcchoi @ericloe-cash this will be fixed in the upcoming v1.48.0 release by https://github.com/DataDog/dd-trace-java/pull/8651
If you'd like to try a pre-release snapshot you can use https://oss.sonatype.org/content/repositories/snapshots/com/datadoghq/dd-java-agent/1.48.0-SNAPSHOT/dd-java-agent-1.48.0-20250404.112215-102.jar
:robot: This issue has been addressed in the latest release. Check full details in the Release Notes.
Hi @mcculls @PerfectSlayer - I've tried this out and I think things are better but I'm seeing issues with the MDC trace/span id propagation.
Anytime there is a scope removed the MDC is cleared, any time a scope is added the MDC is updated. So it seems when there's a depth > 1 and a scope is removed the MDC is not updated. I know from the other MDC values that the custom values are following what's expected from MDCContext.
logDetails("before runBlocking")
runBlocking(MDCContext()) {
logDetails("before withTimeout", coroutineContext)
withTimeout(1000) {
logDetails("before delay", coroutineContext)
delay(5)
// Original MDC is restored here on delay resumption
logDetails("after delay", coroutineContext)
}
// MDC is preserved from "after delay", but trace_id/span_id is cleared.
logDetails("after withTimeout", coroutineContext)
}
// Original MDC is restored here on on resumption
logDetails("after runBlocking")
fun logDetails(log: String, coroutineContext: CoroutineContext? = null) {
val activeTrace = CorrelationIdentifier.getTraceId()
val activeSpan = CorrelationIdentifier.getSpanId()
MDC.put("cash.test.${id.incrementAndGet()}", log)
logger.info(
StringBuilder()
.appendLine(log)
.appendLine("active_trace: $activeTrace")
.appendLine("active_span: $activeSpan")
.appendLine("mdc: ${MDC.getCopyOfContextMap()}")
.toString()
)
}
I added a scope listener so I can see that the active trace/spans seem to be holding correctly, but the log correlation is lost.
val ddTracer = DDGlobalTracer.get()
val internalTracer = ddTracer as? InternalTracer
internalTracer?.addScopeListener({ logDetails("scopeAdded") }, { logDetails("scopeRemoved") })
ddTracer.addTraceInterceptor(
object : TraceInterceptor {
override fun onTraceComplete(trace: MutableCollection<out MutableSpan>?): MutableCollection<out MutableSpan> {
logDetails("onTraceCompleted: ${trace}")
return trace ?: mutableListOf()
}
override fun priority(): Int = 98122
}
)
Never mind - our framework was adding another scope listener that cleared the MDC 🤦♂
@PerfectSlayer @mcculls
Though looking at LogContextScopeListener it seems like this might be the expected implementation. Should afterScopeClosedCallback
be called if we are exiting a nested scope? Since we don't get a corresponding afterScopeActivatedCallback
we are in kind of grey area. The "fix" for removing the trace ids is to no-op here.
I also tried setting the MDC to the current scope, but it seems the tracer is always reporting the previously active scope in afterScopeClosedCallback
fun afterScopeActivated() {
try {
MDC.put(MDC_TRACE_ID, CorrelationIdentifier.getTraceId())
MDC.put(MDC_SPAN_ID, CorrelationIdentifier.getSpanId())
MDC.put(MDC_DD_TRACE_ID, CorrelationIdentifier.getTraceId())
MDC.put(MDC_DD_SPAN_ID, CorrelationIdentifier.getSpanId())
} catch (e: Exception) {
log.debug("Exception setting log context context", e)
}
}
fun afterScopeClosed() {
try {
MDC.put(MDC_TRACE_ID, CorrelationIdentifier.getTraceId())
MDC.put(MDC_SPAN_ID, CorrelationIdentifier.getSpanId())
MDC.put(MDC_DD_TRACE_ID, CorrelationIdentifier.getTraceId())
MDC.put(MDC_DD_SPAN_ID, CorrelationIdentifier.getSpanId())
} catch (e: Exception) {
log.debug("Exception removing log context context", e)
}
}
Hi @ericloe-cash - afterScopeClosed
is called whenever a scope is finally closed, and you should be getting a balance of afterScopeClosed
and afterScopeActivated
calls.
I've found a couple of edge-cases during further testing where the current scope doesn't match what we'd expect, which could affect these callbacks. I'm updating our coroutines instrumentation to handle these edge-cases.
Hi, I am noticing some spans in a trace are missing when enabling Kotlin coroutine instrumentation, eg when
withTimeout
orflowOn
are usedWithout enabling Kotlin coroutine instrumentation, traces are random because the datadog java agent uses ThreadLocals to store some states and coroutine could move between threads
This is a repo to reproduce the issue https://github.com/wcchoi/test-datadog-agent-coroutine