Azure / azure-functions-dotnet-worker

Azure Functions out-of-process .NET language worker
MIT License
419 stars 181 forks source link

Tracing is now very broken in Azure Functions #2733

Open martinjt opened 3 days ago

martinjt commented 3 days ago

Description

I'm not entirely sure what's changed, but I suspect this may have been a host change, but I'll list out what I'm saying based on multiple customer issues we've had.

There is now an empty Activity present in the execution path. This means that whenever a user adds tracing, this new Activity is the parent, however, it's not something that you can listen to since it does not have a source name.

This leads to the new ActivitySource named Microsoft.Azure.Functions.Worker, this does take over the Blank Activity when you listen to the source, however, it uses the TraceContext from the FunctionContext as it's parent, which doesn't use the inbound traceparent from the HttpRequest headers as it's parent. This means that your span has a parent which you're unable to access.

image That is what a trace looks like right now with the OOTB setup.

        services.AddOpenTelemetry()
            .ConfigureResource(resource => resource.AddService("practical-otel-azure-functions"))
            .UseFunctionsWorkerDefaults()
            .WithTracing(tracingBuilder =>
            {
                tracingBuilder
                    .SetSampler<AlwaysOnSampler>()
                    .AddSource("Microsoft.Azure.Functions.Worker")
                    .AddHttpClientInstrumentation()
                    .AddAspNetCoreInstrumentation()
                    .AddOtlpExporter();
            });

Aspire is rendering it in a semi-ok way, that's compensating for the fact that the data itself on those spans isn't right (missing parents, and missing root spans).

I think the big change is that, previously, there was not TraceContext on the function context, and therefore there was no active Activity. Since the upgrade, this has changed and now, there is an Activity which has a remote parent that there is no way to get access to (there's supposed to be a way to access it in Flex Consumption).

To be clear, this is a breaking change. Adding an Activity that is not trackable, and overriding the inbound traceparent causes functions to be untraceable due to the context propagation issues.

The AspNetCore span completely baffles me, since it's got a different parent to the FunctionsWorker span, and also somehow is outside of the hierarchy.

I'd love to know what caused this, and whether there is a plan to fix functions tracing for users who aren't upgrading to Flex Consumption.

Steps to reproduce

Create a new HTTP Trigger function Add OpenTelemetry Send the data to somewhere you can see the raw data. Run it locally Hit the endpoint

Look at the telemetry and you'll see that it doesn't follow a linear hierarchy, and there is no span that doesn't have a parent (aka a root span)

RohitRanjanMS commented 2 days ago

Thanks @martinjt for reaching out, I will provide an update by tomorrow.

jkessenich commented 1 day ago

Apart from this, when you have a HTTP triggered function and send a request to it, a new unparented span (a.k.a. new trace) is also being emitted from the AspNetCore instrumentation, so you have this blank Invoke span and this new trace. Starting a new active span in does not make this the parent.

ScopeSpans #1 ScopeSpans SchemaURL: InstrumentationScope Microsoft.Azure.Functions.Worker 1.0.0.0 Span #0 Trace ID : 63b5dd200c01e184afdf5bafe61c90eb Parent ID : 055abe33f3e3a70e ID : ff98f9123b333119 Name : Invoke Kind : Server Start time : 2024-09-26 16:56:46.7240646 +0000 UTC End time : 2024-09-26 16:56:46.9083439 +0000 UTC Status code : Unset Status message :

ScopeSpans #2 ScopeSpans SchemaURL: InstrumentationScope Microsoft.AspNetCore Span #0 Trace ID : 1e5d60beeafd93c433bb594460c600ff Parent ID : ID : 58a419ff1ceb4d7a Name : POST api/MyApi Kind : Server Start time : 2024-09-26 16:56:46.7395207 +0000 UTC End time : 2024-09-26 16:56:46.9209312 +0000 UTC Status code : Unset Status message :

RohitRanjanMS commented 16 hours ago

@martinjt , you are right. We created this internal activity to provide context since the current activity was null. We intentionally did not use an ActivitySource so it cannot be tracked. I'll look into it and update this thread tomorrow.

image
martinjt commented 16 hours ago

This has caused a breaking change @RohitRanjanMS, I can understand following rhe pattern than ASP.NET Core uses here (the legacy activity without source), however, there is no check for an existing activity here, which if there is, it makes no sense to create a new one.

Further, the host now creates an activity (even in none flex consumption), which means we have no ability to know if this was an entrypoint, and therefore can't override to make a root span.

I get that Aspire isn't flagging that there are missing spans in the trace, that's a separate issue since it should make developers aware there is an issue in their tracing.

The reason this is important is that the lineage of spans is important for searching and also display. Where you're at right now with it, there is no way to render a trace view with the right lineage where there is a function app involved.

In my opinion, there needs to be 2 high priority fixes.

  1. Check for an existing activity and don't create the invisible one if that's the case.
  2. Change the host such that it doesn't create spans by default.

Those 2 will fix all the issues this change has caused.