elastic / apm-agent-java

https://www.elastic.co/guide/en/apm/agent/java/current/index.html
Apache License 2.0
560 stars 320 forks source link

OTel SpanLink results in ClassCastException #3662

Open NicklasWallgren opened 1 month ago

NicklasWallgren commented 1 month ago

Describe the bug

The OTelSpanBuilder is unable to handle linked AutoValue_ImmutableSpanContext.

Should io.opentelemetry.api.trace.SpanContext.create() be instrumented and return a OTelSpanContext instead?

java.lang.ClassCastException: class io.opentelemetry.api.internal.AutoValue_ImmutableSpanContext cannot be cast to class co.elastic.apm.agent.opentelemetry.tracing.OTelSpanContext (io.opentelemetry.api.internal.AutoValue_ImmutableSpanContext is in unnamed module of loader 'app'; co.elastic.apm.agent.opentelemetry.tracing.OTelSpanContext is in unnamed module of loader co.elastic.apm.agent.bci.classloading.IndyPluginClassLoader @7853756f)
    at co.elastic.apm.agent.opentelemetry.tracing.OTelSpanBuilder.startSpan(OTelSpanBuilder.java:198)

https://github.com/elastic/apm-agent-java/blob/0e1fb43f64076fe77f739609513a1f8e7e1cff18/apm-agent-plugins/apm-opentelemetry/apm-opentelemetry-plugin/src/main/java/co/elastic/apm/agent/opentelemetry/tracing/OTelSpanBuilder.java#L198

Steps to reproduce

final Tracer tracer = GlobalOpenTelemetry.get()
    .tracerBuilder("")
    .setInstrumentationVersion("0.10.0")
    .build();

final SpanContext parentSpanContext = SpanContext.create(
  "f1f6ae85a6cbd46ef0fc12ca201a5954", "e9effc2a1b309797", TraceFlags.getDefault(), TraceState.getDefault());

final Span span =  tracer
    .spanBuilder("queue.task")
    .setSpanKind(SpanKind.CONSUMER)
    .setParent(Context.current())
    .addLink(parentSpanContext)
    .startSpan();

Expected behavior

The linked span context should work as expected.

Solution

I began implementing a SpanContextOpenTelemetryInstrumentation, but was unable to set the traceId in TraceContext without using reflection.

public class SpanContextOpenTelemetryInstrumentation extends AbstractOpenTelemetryInstrumentation {

    @Override
    public ElementMatcher<? super TypeDescription> getTypeMatcher() {
        return named("io.opentelemetry.api.trace.SpanContext");
    }

    @Override
    public ElementMatcher<? super MethodDescription> getMethodMatcher() {
        return named("create");
    }

    @Override
    public String getAdviceClassName() {
        return "co.elastic.apm.agent.opentelemetry.SpanContextOpenTelemetryInstrumentation$SpanContextOpenTelemetryAdvice";
    }

    public static class SpanContextOpenTelemetryAdvice {
        @Advice.AssignReturned.ToReturned
        @Advice.OnMethodExit(suppress = Throwable.class, onThrowable = Throwable.class, inline = false)
        public static OTelSpanContext onExit(@Advice.Argument(0) String traceHexId, @Advice.Argument(1) String spanHexId) {
            Id traceId = Id.new128BitId();
            traceId.fromHexString(traceHexId, 0);

            Id spanId = Id.new64BitId();
            spanId.fromHexString(spanHexId, 0);

            // TODO handle traceFlags and traceState

            TraceContext traceContext = TraceContext.ofId(spanId, GlobalTracer.get().require(ElasticApmTracer.class)); <--- new static factory method

            return new OTelSpanContext(traceContext);
        }
    }
}
SylvainJuge commented 3 weeks ago

Hi,

I've managed to reproduce the error that you are facing here and the agent should not trigger a ClassCastException and should instead either silently ignore or issue a warning.

The problem here is that you are manually creating a SpanContext with OpenTelemetry API and using this as a span link. Our OpenTelemetry implementation currently does not support arbitrary SpanContext like you did.

Can you elaborate a bit on the use-case that you are trying to achieve here ? Maybe there is a simpler option by using a span attribute as a work-around for your use case.

SylvainJuge commented 3 weeks ago

I have just implemented a "work-around" that just properly warns that using arbitrary spans context is not supported with the otel bridge. https://github.com/elastic/apm-agent-java/pull/3672

A snapshot is available here if you'd like to test it.

NicklasWallgren commented 3 weeks ago

Hi,

I've managed to reproduce the error that you are facing here and the agent should not trigger a ClassCastException and should instead either silently ignore or issue a warning.

The problem here is that you are manually creating a SpanContext with OpenTelemetry API and using this as a span link. Our OpenTelemetry implementation currently does not support arbitrary SpanContext like you did.

Can you elaborate a bit on the use-case that you are trying to achieve here ? Maybe there is a simpler option by using a span attribute as a work-around for your use case.

Thanks for the reply.

Our goal is to spin up a new trace per asynchronous job, and associate the newly created trace with the one that enqueued the job. Similar to what is described here; https://opentelemetry.io/docs/concepts/signals/traces/#span-links

Request (Trace 1) --> Enqueue Job (metadata: T1TraceId, T1SpanId) -> Asynchronous Job Executed (Spin up Trace 2, and associated with Trace 1).

Is this achievable using the otel bridge, or the elastic apm api?

SylvainJuge commented 3 weeks ago

Do you have a strong requirement to use Elastic APM agent here ?

As you are already using the OpenTelemetry API, it might be more natural to use an opentelemetry agent and not the opentelemetry bridge in our APM agent.

I would suggest to use our new opentelemetry java distribution or the upstream opentelemetry instrumentation agent, the data captured by those can be ingested as-is in APM.

NicklasWallgren commented 3 weeks ago

Do you have a strong requirement to use Elastic APM agent here ?

As you are already using the OpenTelemetry API, it might be more natural to use an opentelemetry agent and not the opentelemetry bridge in our APM agent.

I would suggest to use our new opentelemetry java distribution or the upstream opentelemetry instrumentation agent, the data captured by those can be ingested as-is in APM.

Thanks for the reply. The opentelemetry-agent might be the right way to go.

Will the https://github.com/elastic/elastic-otel-java/ replace the apm-agent-java, in the long run?

jackshirazi commented 3 weeks ago

Will the https://github.com/elastic/elastic-otel-java/ be replaced by the apm-agent-java, in the long run?

The classic Elastic APM Java agent has many features not yet available in the OpenTelemetry distribution. We are implementing these in our distribution and gradually contributing these upstream to OpenTelemetry. This is likely to take some time, so the long run could be quite long and ultimately it will come down to customer demand. Which is a long-winded way of saying maybe