Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.05k stars 1.19k forks source link

Distributed tracing issues in Azure Function App - No Correlation #24116

Open He-Wolf opened 1 year ago

He-Wolf commented 1 year ago

I created a Azure Function App with two functions:

[HttpTrigger] http-triggered function that sends an EventGrid event in CloudEvent schema to a topic [CloudEventsTrigger] http-triggered function that subscribes to the EventGrid topic using CloudEvent schema (webhook) After triggering the first function, I can see in App Insights that the two functions have different Operation IDs and the consumer app's traces are not linked to the producer app.

What I understood from the devblog post is that Azure Event Grid client libraries support distributed tracing for the CloudEvents schema out of the box.

I created a repo for my simple project. It also contains a deployment script so that you can test it yourself. I summarized my finding in the README.md. I think all the info needed for investigation can be found in the project.

Let me know if you need additional data.

xirzec commented 1 year ago

/cc @ellismg

ellismg commented 1 year ago

Thanks for the helpful repo here, @He-Wolf, I will take a look!

haodeon commented 1 year ago

I'm not an expert in this area but I have an interest because we are seeing similar issues with telemetry in our Typescript function apps.

I created a codespace from @He-Wolf example project and I could reproduce the issue.

I could get things to work a bit better by running

npm i @opentelemetry/instrumentation
npm i @azure/opentelemetry-instrumentation-azure-sdk

Once these packages were installed, I re-ran the sample project and noted things improved but it was still producing different traceparent IDs.

The first change is application insights captures event grid as a dependency. The dependencies also appear in the HttpTrigger timeline.

image

The other difference is traceparent can now be found in the arriving event in CloudEvents schema.

image

This isn't the original traceparent id though, which you can see in the first picture.

You can also find this traceparent id in extensionAttributes of the event trace. The first time I ran this, extensionAttributes does not appear.

Despite this improvement the two operations do not link in End-to-end transaction details as it appears in the devblog post.

haodeon commented 1 year ago

I've discovered another interesting aspect to this. In response above I show traceparent in the arriving event and in extensionAttributes.

At first glance this traceparent is not related to the HttpTrigger function app.

However, I discovered the parentid of the traceparent is the same as the id of the EventGridPublisherClient.send POST as you can see in the below picture.

image

What I don't understand is why EventGrid changes the traceid portion. Is this causing the two traces to not be linked? Or do we need to manually link the traces like it explains in the devblog? If manual linking is required, how is this done in JavaScript?

haodeon commented 1 year ago

After much research I managed to get the CloudEventsTrigger function to link with the HttpTrigger function

image

In addition to adding the missing packages I also had to do two more changes

The first is setting a custom delivery property on the eventgrid subscription --delivery-attribute-mapping traceparent dynamic traceparent. Event Grid takes the traceparent-id from the CloudEvents extension and inserts it back into the webhook http header. The ApplicationInsights-node.js library is the able to pick up the id from the http header.

The second change is manually injecting traceparent into the event. This is a workaround.

extensionAttributes: {
  traceparent: context.traceContext.traceparent
},

It is needed because EventGridPublisherClient changes the trace-id portion of traceparent. I still don't understand why it is doing this. Unfortunately, manually adding the extensionAttribute sets the HttpTrigger function as parent instead of the POST dependency.

xirzec commented 1 year ago

@ellismg @mpodwysocki is there anything we can do on the SDK side here?

haodeon commented 1 year ago

After much research into how azure/core-tracing works, I eventually came across this note in the Azure Core Client libraries folder.

From that note I eventually figured out you can extract a tracingContext from the Azure Function context object using opentelemetry propagation.extract. Then as per the note pass it as an OperationOption to EventGridPublisherClient.send.

I also removed the extensionAttributes workaround in my previous update. Now the timeline looks perfect.

image

These steps are documented in the otel docs as manual trace context propagation.

The issue is the Event Grid client only provides a single send operation using withSpan for tracing. So auto-instrumentation will always produce a new tracingContext with every send. Which explains the behaviour I mentioned in the past with trace-id being changed.

I believe there needs to be some improvement in the Event Grid client or Application Insights Node SDK to support auto-instrumentation with a proper propagated context.