Open He-Wolf opened 1 year ago
/cc @ellismg
Thanks for the helpful repo here, @He-Wolf, I will take a look!
I'm not an expert in this area but I have an interest because we are seeing similar issues with telemetry in our Typescript function apps.
I created a codespace from @He-Wolf example project and I could reproduce the issue.
I could get things to work a bit better by running
npm i @opentelemetry/instrumentation
npm i @azure/opentelemetry-instrumentation-azure-sdk
Once these packages were installed, I re-ran the sample project and noted things improved but it was still producing different traceparent IDs.
The first change is application insights captures event grid as a dependency. The dependencies also appear in the HttpTrigger timeline.
The other difference is traceparent can now be found in the arriving event in CloudEvents schema.
This isn't the original traceparent id though, which you can see in the first picture.
You can also find this traceparent id in extensionAttributes of the event trace. The first time I ran this, extensionAttributes does not appear.
Despite this improvement the two operations do not link in End-to-end transaction details as it appears in the devblog post.
I've discovered another interesting aspect to this. In response above I show traceparent in the arriving event and in extensionAttributes.
At first glance this traceparent is not related to the HttpTrigger function app.
However, I discovered the parentid of the traceparent is the same as the id of the EventGridPublisherClient.send POST as you can see in the below picture.
What I don't understand is why EventGrid changes the traceid portion. Is this causing the two traces to not be linked? Or do we need to manually link the traces like it explains in the devblog? If manual linking is required, how is this done in JavaScript?
After much research I managed to get the CloudEventsTrigger function to link with the HttpTrigger function
In addition to adding the missing packages I also had to do two more changes
The first is setting a custom delivery property on the eventgrid subscription --delivery-attribute-mapping traceparent dynamic traceparent
. Event Grid takes the traceparent-id from the CloudEvents extension and inserts it back into the webhook http header. The ApplicationInsights-node.js library is the able to pick up the id from the http header.
The second change is manually injecting traceparent into the event. This is a workaround.
extensionAttributes: {
traceparent: context.traceContext.traceparent
},
It is needed because EventGridPublisherClient changes the trace-id portion of traceparent. I still don't understand why it is doing this. Unfortunately, manually adding the extensionAttribute sets the HttpTrigger function as parent instead of the POST dependency.
@ellismg @mpodwysocki is there anything we can do on the SDK side here?
After much research into how azure/core-tracing works, I eventually came across this note in the Azure Core Client libraries folder.
From that note I eventually figured out you can extract a tracingContext
from the Azure Function context object using opentelemetry propagation.extract
.
Then as per the note pass it as an OperationOption
to EventGridPublisherClient.send
.
I also removed the extensionAttributes workaround in my previous update. Now the timeline looks perfect.
These steps are documented in the otel docs as manual trace context propagation.
The issue is the Event Grid client only provides a single send operation using withSpan
for tracing. So auto-instrumentation will always produce a new tracingContext
with every send. Which explains the behaviour I mentioned in the past with trace-id being changed.
I believe there needs to be some improvement in the Event Grid client or Application Insights Node SDK to support auto-instrumentation with a proper propagated context.
I created a Azure Function App with two functions:
[HttpTrigger] http-triggered function that sends an EventGrid event in CloudEvent schema to a topic [CloudEventsTrigger] http-triggered function that subscribes to the EventGrid topic using CloudEvent schema (webhook) After triggering the first function, I can see in App Insights that the two functions have different Operation IDs and the consumer app's traces are not linked to the producer app.
What I understood from the devblog post is that Azure Event Grid client libraries support distributed tracing for the CloudEvents schema out of the box.
I created a repo for my simple project. It also contains a deployment script so that you can test it yourself. I summarized my finding in the README.md. I think all the info needed for investigation can be found in the project.
Let me know if you need additional data.