DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
550 stars 278 forks source link

Lack of Trace Context/Span Propagation in Asynchronous Reactive Operations #6272

Open ahmadwrites opened 7 months ago

ahmadwrites commented 7 months ago

Description

I am encountering an issue where trace context is not being properly propagated across asynchronous reactive operations in my WebFlux service. Currently using the dd-trace-java library for distributed tracing.

Issue Details

Bug

Our endpoint service/flow contains multiple reactive operation chaining which makes asynchronous requests to other services. We expect to receive one traceId containing spans of all these different operations, but understandably they are broken down into separate traceIds, primarily in netty.request and others being propagated at http.request in DataDog. Ideally we would want to be able to view these as spans in one traceId, as it is being initiated by that flow/endpoint. As such, we are not receiving full/continuous observability while using WebFlux in our project. Is there a way to correlate all spans together in one trace, without manually defining each span in sections of the reactive code?

amarziali commented 4 months ago

šŸ‘‹ Hello the behaviour the you are describing let us think about a context propagation loss. We have in integration that's disabled by default but that can help having a proper context propagation when using reactor operators. Could you please try enabling it by either:

Please let us know if it mitigates that issue

Cheers

Andrea

ahmadwrites commented 4 months ago

šŸ‘‹ Hello the behaviour the you are describing let us think about a context propagation loss. We have in integration that's disabled by default but that can help having a proper context propagation when using reactor operators. Could you please try enabling it by either:

  • the system property -Ddd.trace.integration.reactor-hooks.enabled=true
  • the env variable: DD_TRACE_INTEGRATION_REACTOR_HOOKS_ENABLED=true

Please let us know if it mitigates that issue

Cheers

Andrea

Hi, I believe the reason for the cut off is that R2DBC is not currently being supported by DD. Will enabling this sort the issue?

amarziali commented 4 months ago

Hello we never tested it in this scenario. However, r2dbc will need some additional instrumentation and today is not supported. I'll flag as a feature request. Thanks for having detailed the scenario

ahmadwrites commented 4 months ago

Just an update, it did help link up some missing spans (understandably, the R2DBC operations are still missing). There were some improvements by adding it in, thank you! Any estimation regarding the feature request?

iNviNho commented 6 days ago

We are having the same issue.

We are using project-reactor and we can see that the trace is stopped as soon as we fire HTTP calls inside of validations we have that are wrapped by Mono.zip()


  @Override
  @Trace(
    operationName = JOBS,
    resourceName = "ExternalLockProcessManager::handle",
    noParent = true
  )
  public Mono<ExternalLockRequest.Data> handle(@NonNull final ExternalLockRequest.Data externalLockRequest) {
..
    return Mono.just(externalLockRequest)
...
      .flatMap(ext -> {
...
        return Mono.zip(
            userAllowedToTradeValidator.isUserAllowedToTradeValidation(externalLockRequest, user),
            userEligibilityValidator.isUserNotEligibleValidation(externalLockRequest, user, quoteAsset),
            tradingDisabledValidator.isTradingDisabledValidation(externalLockRequest, user))
          .map(___ -> dto);
      })
...

So instead of 1 trace, we are seeing approximately 2-4 SEPARATE traces based on whether all asynch Monos had a chance to be executed & traced.

image

I quickly tried enabling DD_TRACE_INTEGRATION_REACTOR_HOOKS_ENABLED but without any luck šŸ¤”

iNviNho commented 5 days ago

I found an issue. It was not an issue in dd-trace-java package per see.

The issue was that the trace was closed before the Mono.zip(...) had a chance to finish.

My workaround for this was to start custom trace before the handle method is called and finishing the trace after the mono is completed.

Something like this šŸ‘‡ I hope it helps someone facing the same issue.


          final Span span = tracer.buildSpan(TracingConstants.JOBS)
            .withTag(DDTags.RESOURCE_NAME, lockManager.getClass().getSimpleName() + TracingConstants.HANDLE)
            .ignoreActiveSpan().start();
          final Scope scope = tracer.activateSpan(span);

          return Mono.just(externalLockRequestData)
            .flatMap(lockManager::handle)
            .doFinally(signalType -> span.finish())
            .doFinally(signalType -> scope.close())
            ;