Netflix / dgs-framework

GraphQL for Java with Spring Boot made easy.
https://netflix.github.io/dgs
Apache License 2.0
3.06k stars 295 forks source link

bug: OpenTelemetry Context is not getting propagated in the data loader #1928

Closed govi20 closed 3 weeks ago

govi20 commented 4 months ago

Please read our contributor guide before creating an issue.

I have an async resolver which I am using along with a data loader. For traceability, I am using open telemetry with manual instrumentation. The problem I am facing here is that OpenTelemetry context is getting lost in the load method. This is happening even before offloading the task on the context-wrapped executor.

Expected behavior

Otel Span and context should be available in the data loader.

Actual behavior

Otel Span and context are not present. I am getting span and trace with ids as 0000000000000000 and 00000000000000000000000000000000.

Steps to reproduce

Sharing the sample code snippet.

Resolver:

  @DgsData(parentType = "Department", field = "employees")
  public CompletionStage<Employees> getDepartmentEmployees(
    final DataFetchingEnvironment dfe,
  ) {
    System.out.println(tracer.getSpan().getSpanContext().getTraceId());   // getting appropriate trace and spanIds
    System.out.println(tracer.getSpan().getSpanContext().getSpanId());   

    Context context = Context.current(); // Context has correct trace and span.

    Department department = dfe.getSource();
    DataLoader<Department, Employees> batchDataLoader =
      dfe.getDataLoader("employeesDataLoader");

    return batchDataLoader.load(department);
  }

Data Loader:

 @Override
  public CompletionStage<List<Employees>> load(List<Department> departments) {
    System.out.println(tracer.getSpan().getSpanContext().getTraceId());   // 00000000000000000000000000000000
    System.out.println(tracer.getSpan().getSpanContext().getSpanId());    // 0000000000000000

    Context context = Context.current(); // empty object

    return CompletableFuture.supplyAsync(() -> departmentService.getEmployees(departments), contextWrappedThreadPool);
  }

I have tried it on DGS 8.5.5, graphql-java 21.5 and tested it on both platform and virtual thread.

Note: A test case would be highly appreciated, but we understand that's not always possible

govi20 commented 3 months ago

@paulbakker can you please share your insights on this?

srinivasankavitha commented 3 months ago

I don't think this addresses the issue you are experiencing, but can you retest with the latest DGS release? We changed the way data loaders are implemented here: https://github.com/Netflix/dgs-framework/releases/tag/v8.7.1

For your scenario - where is Context from? I would suspect this is an underlying issue with the tracer library if it's not propagating, because the framework is not doing anything special to propagate context in such scenarios. Only when thread switching happens, is there a chance of thread local context being lost, so as long as the executor is set up to have context propagation, it should work. However, as you point out in this case, the context loss happens even before there is any thread switching. So this would point to some issue with the tracer itself.

govi20 commented 3 months ago

@srinivasankavitha Context comes from the OpenTelemetry Java library.

I am adding events to the OTel span before making any downstream service calls from the GraphQL service. But as the Context is not propagated, these span events are lost.

BTW I was able to reproduce it with 8.7.1

kilink commented 1 month ago

I looked into this a bit, and I don't think it's a bug in DGS or graphql-java, but maybe a misunderstanding with how DataLoaders work. When you call the load method on your DataLoader it is not invoked immediately since batching is enabled by default. So that means TracingAspect.traceAsyncField is closing the scope before the loader is even called, which explains why by the time the loader is called, Context.current() is reset.

You can verify that the whole issue is caused by batching by disabling it on your DataLoader:

@DgsDataLoader(name = "departmentDataLoader", batching = false)
public class DepartmentDataLoader implements BatchLoader<Parameter, Department>

When I tried this locally, the context was correct in the DataLoader load method. Now obviously this isn't a proper solution, since the whole point of data loaders is to batch things and avoid the N+1 problem. So I think it requires retooling or reworking how you're doing things. Explicitly passing the context into the DataLoader works I think, but then you just need to explicitly set it in the thread local context or whatever, or explicitly pass it to your executor. Another option is to implement BatchLoaderWithContext, and either pass in a per-key context when calling load(key, context), or configure a context supplier in the DataLoader options (see: https://github.com/graphql-java/java-dataloader?tab=readme-ov-file#calling-the-batch-loader-function-with-call-context-environment). For the latter, I don't think DGS currently exposes a simple way of setting the context provider, you'd need to provide a custom DgsDataLoaderOptionsProvider bean.

govi20 commented 1 month ago

@kilink Thanks for sharing your input.

I didn't understand your point regarding traceAsyncField closing the scope. Even if the traceAsyncField method finishes, it won't end the scope, I am closing the span only once the completable task is completed. am I missing anything?

paulbakker commented 3 weeks ago

Like @kilink explains, I don't think there's anything in the framework we can do about this. Closing the issue as it is not actionable.