elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 523 forks source link

OpenTelemetry active message consumption spans are incorrectly classified as transactions #7001

Open axw opened 2 years ago

axw commented 2 years ago

APM Server converts OpenTelemetry spans to either transactions or spans, depending on several criteria:

The issue with converting all consumer OTel spans to Elastic APM transactions is that not all consumer spans represent service entrypoints. For example, a message may be actively consumed (e.g. with active polling) within another OTel span. In this case, the outer span should become an Elastic APM transaction, and the message consumption span should become an Elastic APM span.

We require additional information to be recorded in OTLP for this: see https://github.com/open-telemetry/oteps/pull/182

ymotongpoo commented 1 year ago

Is this still the case? I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.

Untitled image
axw commented 1 year ago

Hi @ymotongpoo! Yes, this is still the case.

I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.

Can you share the snippet? I can't tell from the screenshot which spans should or should not be considered transactions.

The code relevant to this bug is: https://github.com/elastic/apm-server/blob/efc8be5b0b07b17cfc7dcfdee44dd248f86e01e4/internal/processor/otel/traces.go#L196-L225

If/when the linked OTEP is accepted and implemented, then instead of checking if root || otelSpan.Kind() == ptrace.SpanKindServer || otelSpan.Kind() == ptrace.SpanKindConsumer { we could check if root || otelSpan.HasRemoteParent() {. Only in that case would we translate to an Elastic APM transaction.

ymotongpoo commented 1 year ago

Hi @axw, thanks for the response! Here's the code snippet I used for the demo.

func handler(w http.ResponseWriter, r *http.Request) {
    q := r.URL.Query()
    c := q.Get("c")
    n, err := strconv.Atoi(c)
    if err != nil {
        n = 2
    }
    if n > 8 {
        n = 8
    }
    ctx, span := otel.Tracer("handler").Start(r.Context(), "sample.handler.call.foo")
    foo(ctx, n)
    span.End()
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "requested: %v, executed: %v", c, n)
}

func foo(ctx context.Context, n int) {
    if n == 0 {
        return
    }
    ctx, span := otel.Tracer("handler").Start(ctx, "sample.foo")
    span.SetAttributes(attribute.Key("iteration").Int(n))
    time.Sleep(interval)
    span.AddEvent(fmt.Sprintf("mid-interval-%v", n))
    time.Sleep(interval)
    foo(ctx, n-1)
    span.End()
}

I didn't used OpenTelemetry Collector and directly sent the traces with OTLP exporter from the app.

    // OTLP exporter config for Elastic Observability
    otlpexp, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint(OTLPEndpoint),
        otlptracegrpc.WithHeaders(OTLPHeaders),
    )
    if err != nil {
        return nil, err
    }

Because this is a standalone demo running on Cloud Run and doesn't have other depending services, so the traces should be either of:

axw commented 1 year ago

I see, thanks. This is not exactly the same issue as in the description, but would also be solved by the change I described.

For this case you have a workaround: set the sample.handler.call.foo span kind to SpanKindServer. IIANM, it will otherwise be SpanKindInternal.

ymotongpoo commented 1 year ago

Thank you for the workaround, @axw. However, neither of SpanKind worked and still all traces are recognized as transactions.

Just in case, I changed the SpanKind of sample.foo as well, but it didn't work. Here's the current code that is instrumented with the workaround.

func handler(w http.ResponseWriter, r *http.Request) {
    q := r.URL.Query()
    c := q.Get("c")
    n, err := strconv.Atoi(c)
    if err != nil {
        n = 2
    }
    if n > 8 {
        n = 8
    }
    ctx, span := otel.Tracer("handler").Start(
        r.Context(),
        "sample.handler.call.foo",
        trace.WithSpanKind(trace.SpanKindInternal),
    )
    foo(ctx, n)
    span.End()
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "requested: %v, executed: %v", c, n)
}

func foo(ctx context.Context, n int) {
    if n == 0 {
        return
    }
    ctx, span := otel.Tracer("handler").Start(
        ctx,
        "sample.foo",
        trace.WithSpanKind(trace.SpanKindInternal),
    )
    span.SetAttributes(attribute.Key("iteration").Int(n))
    time.Sleep(interval)
    span.AddEvent(fmt.Sprintf("mid-interval-%v", n))
    time.Sleep(interval)
    foo(ctx, n-1)
    span.End()
}
axw commented 1 year ago

@ymotongpoo sorry, I missed something in your original comment and went off track a bit. I think what you're seeing here is related to Cloud Run injecting a traceparent header. First of all I'll try to clarify some things, and correct some comments I made above.

Transactions are a special kind of span that have additional attributes associated with them. They describe an event captured by an Elastic APM agent instrumenting a service. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction might be a:

  • Request to your server
  • ...

I guess you already know this, but just so we're on the same page:

https://opentelemetry.io/docs/reference/specification/trace/api/#spankind says:

SERVER Indicates that the span covers server-side handling of a synchronous RPC or other remote request. This span is often the child of a remote CLIENT span that was expected to wait for a response. ... INTERNAL Default value. Indicates that the span represents an internal operation within an application, as opposed to an operations with remote parents or children.

I think I misunderstood your trace before, and thought that sample.handler.call.foo was meant to represent the incoming HTTP request handler. After looking at your screenshots again, I suppose that you are also using otlphttp to wrap your handler, and that is the sample.handler transaction? In that case, sample.handler.call.foo and sample.foo should have the internal span kind, and you don't need to explicitly use WithSpanKind -- sorry about that.

Going back to your initial comment, I overlooked something:

I tried this with a small snippet on v8.5.1 and all traces are recorded as transactions, not traces.

The Traces page is simply listing root transactions, by looking for any transactions without a parent ID. In your example we should expect that sample.handler is listed on the Traces page. Given that it doesn't show up, that implies sample.handler does have a parent ID. You could check that by clicking the sample.handler row and looking for a parent.id field under the metadata.

Assuming that's the case, I think you're hitting something similar to this issue: https://github.com/elastic/apm/issues/609. i.e. I expect Cloud Run is injecting a traceparent header, and your instrumentation is continuing the trace with that trace context. It therefore doesn't think the server transaction is the root, and doesn't show up under Traces. You could confirm this by passing trace.WithNewRoot to the otlphttp instrumentation.

ymotongpoo commented 1 year ago

@axw Thank you for the detailed answer.

i.e. I expect Cloud Run is injecting a traceparent header, and your instrumentation is continuing the trace with that trace context.

I realized I missed the fact that Cloud Run injects the traceparent automatically.

You could confirm this by passing trace.WithNewRoot to the otlphttp instrumentation.

As you suggested, I tried creating span with trace.WithNewRoot() and now I see sample.handler in the list. (I also tried using trace.WithNewRoot() from sample.handler.call.foo and it was recognized in the trace list as well)

image

The snippet was:

    otelHandler := otelhttp.NewHandler(
        http.HandlerFunc(handler),
        "sample.handler",
        otelhttp.WithSpanOptions(trace.WithNewRoot()),
    )

Given above, as you pointed out, my case seems about https://github.com/elastic/apm/issues/609.