Azure / azure-webjobs-sdk

Azure WebJobs SDK
MIT License
739 stars 358 forks source link

Distributed tracing in Azure Functions #1731

Open lmolkova opened 6 years ago

lmolkova commented 6 years ago

I'll go through the planned changes (split by several PRs) and describe high-level design and expected outcome.

Here is a messy prototype for all changes: https://github.com/lmolkova/azure-webjobs-sdk/commit/22a7affce96f0c34363b3746d0af5797a1c1c593

I'm open to any feedback and suggestions and want to make sure we are on the same page before I start the implementation

Motivation

We want to enable distributed tracing for Azure Functions using ApplicationInsights users are able to track calls to and from the Azure Functions and correlate it to the rest of telemetry reported by other applications.

As a result, we want to enable experiences like End-to-end transaction viewer and Application Map for all users automagically.

Goal 1: supported auto-tracking (triggers only are tracked now)

ApplicationInsights can track HTTP, ServiceBus, EventHub, SQL calls automagically if configured so. It's a component responsibility to participate in distributed tracing, i.e. when new components (e.g. ServiceGrid) implement specific instrumentation, they just light up - requiring no changes from Azure Functions.

Enabling current tracking capabilities gives us following level of support:

[UPDATE] -v1 and all bindings that are not supported in v2 are removed [UPDATE] -added correlation column for clarity

Type Trigger Input Output Calls Tracking E2E Correlation
Blob Storage ✔ (Http) :x: Function-to-function only
Cosmos DB ✔ (Http - Azure Functions use default configuration, i.e. all communication is Http based) N/A
Event Grid N/A (trigger only) N/A
Event Hubs ✔ in single dispatch sceanrios, TBD in multiple dispatch scenarios
Http ✔ W3C and Request-Id standards
Microsoft Graph Excel tables ✔ (Http) N/A
Microsoft Graph OneDrive files ✔ (Http) N/A
Microsoft Graph Outlook email ✔ (Http) N/A
Microsoft Graph Events ✔ (Http) :x:
Microsoft Graph Auth tokens ✔ (Http) N/A
Mobile Apps ✔ (Http) N/A
Notification Hubs ✔ (Http) N/A
Queue storage ✔ (Http) :x: Function-to-function only
SendGrid :x: :x:
Service Bus
Table storage ✔ (Http) N/A
Timer N/A (trigger only) N/A
Twilio ✔ (Http) N/A

Goal 2: Correlation.

Correlation is achieved using .NET System.Diagnostics.Activity that holds all tracing context. Anyone can access static Activity.Current to get current context.

Each component is responsible to start a new Activity and propagate the context to downstream service (Azure Storage does not support it though). I.e. ServiceBus knows how to serialize Activity into the message.

Component is also responsible for extracting Activity from the wire. I.e. ServiceBus knows how to extract Activity from the messsage.

So, we will leverage this in Functions.

The Plan

PR1: Dependency Injection (optional, but highly desirable)

Motivation

Now, configuration of applicationInsights is done via DefaultTelemetryClientFactory.

With enabling dependency collection and correaltion this is going to grow and become complicated.

It's also hard to tune for users.

Another issue is that if AppInsights enables new features, Azure Functions would need to change the code to enable them as well.

Proposal

Leverage Dependency Injection to instantiate and control lifetime of AppInsights instances. Hide it beyond IHostBuilder.ConfigureApplicationInsights (Configure/Add/Use) extension. Eventually, ApplicationInsights should provide such extension for non-Web Hosts and maintain it (now it works in AspNetCore SDK for web hosts).

PR2: Tracking dependency calls

Proposal

As a result, users should be able to see all supported calls, and control verbosity for each category separately.

Host logs would not have any scope and would not be logged for now (which is easy to change).

Open questions:

PR3: Enable auto-requests collection (function executions) with parent context

Motivation

ServiceBus (EventHubs in future) processing are instrumented already as well as HTTP Asp.NET Core hosting. I.e. ApplicaitonInsights could track such calls automagically.

Since each component knows how to extract trace context and set it on the Activity, we want to leverage it.

So, there are 3 types of triggers

Proposal

Open questions:

PR4: HTTP extension and context extraction

Motivation

HTTP request tracking could be automagical, but unfortunately, we cannot configure the listener for it now - it's an internal ApplicationInsights thing - we will expose it. But we still want to get tracing context from the incoming requests and stamp it on the telemetry.

Proposal


Future work outside of Azure functions: will be done later at some point (not blocking current effort)

  1. Splitting ApplicationInsights AspNetCore SDK into several packages:

    • one that supports generic DI and logging, no HTTP/Web
    • another one that adds Web and ASP.NET Core support
    • make multi-ikey and http listener configuration public
  2. Efficient filtering for dependency telemetry (ApplicationInsights SDK)

  3. EventHub Processor instrumentation [EDITED - does not make sense]

lmolkova commented 6 years ago

@brettsam @paulbatum @fabiocav please review the plan and provide your comments. /cc @SergeyKanzhelev @zakimaksyutov

brettsam commented 6 years ago

Great writeup! Here's some comments:

Overall

If we're not going to backport this to v1.x, we should remove all those checkmarks so it's clear we're only doing this for v2.x.

PR1

I'm all for doing things the right way. Let's try to make this work. Some custom stuff we do today that I'd think we want to include:

PR2

looks good -- we can discuss the open questions separately

PR3

Some questions:

PR4

Is there something that the HttpTrigger can do to allow this happen automatically? Which part is preventing that?

lmolkova commented 6 years ago

@brettsam Thanks!

I updated the table to reflect v2 only.

PR1

I agree all current stuff (initializers, filters, SDK version) should be added into the DI and participate in TeletmeryConfiguration creation. Later, when AI SDK provides generic DI config, we'll still be able to adjust this config and add custom things

PR3

So these client libraries (EH, SB, etc) are responsible for setting Activity.Current when the message, etc, is received? Or is that something that WebJobs (or AI) will need to do?

Yes, SB and HTTP (and eventually EH) can extract Activity from the message and make it current.

SB do this already today - nothing is expected from WebJobs

EH can only extract, but not assign current, for EH WebJobs part would look like

var message = value.GetSingleEventTriggerInput(i);
var activity = message.Events.SingleOrDefault()?.ExtractActivity();
TriggeredFunctionData input = new TriggeredFunctionData
{
    ParentActivity = activity,
    TriggerValue = message
};
Task task = _parent._executor.TryExecuteAsync(input, _cts.Token);

ApplicaitonInsights logger would do following (instead of RequestTelemtery creation)

if (parentActivity != null && parentActivity != Activity.Current)
{
    stateValues[OperationContext] = _telemetryClient.StartOperation<RequestTelemetry>(parentActivity);
}

For Http, this functionality is enabled by specific listener we cannot use right now (see PR4 below). So in HttpTrigger we'd have similar code to extract Activity from HttpHeaders and pass it to the executor and logger.

Blob/queue trigger code would be similar to EH, except we'll assugn parent explicitly

Guid? parentId = await _causalityReader.GetWriterAsync(blob, cancellationToken);
Activity activity = new Activity("BlobQueueTriggerExecutor");
if (parentId != null)
    activity.SetParentId(parentId.ToString());

TriggeredFunctionData input = new TriggeredFunctionData
{
    ParentActivity = activity,
    TriggerValue = blob
};

return await registration.Executor.TryExecuteAsync(input, cancellationToken);

What do you mean by 'function serialization into the persistent queue"?

Using Activity instead of ParentId leads to changes in FunctionStartedMessage, FunctionStartedMessage, InstanceTableEntity. which are serialized into Azure Table and Queue. Is it correct it is used for logging only? I also expect it is somehow serialized and propagated to non .NET functions? I have not think about it yet, but we'll need to ensure correlation would work there as well.

PR4

Is there something that the HttpTrigger can do to allow this happen automatically? Which part is preventing that?

Instrumentation is only available in presence of the listener. ApplicationInsights ASP.NET Core SDK implements this listener, and it can be configured from the WebJobs (I was mistakenly thinking that is it not). But:

  1. it would require WebJobs.Logging.ApplicationInsights to depend on ApplicationInsights.AspNetCore SDK (that brings Microsoft.AspNetCore.Hosting and few more dependencies). Is it a problem?
  2. The instrumentation exists in Microsoft.AspNetCore.Hosting and it is not explicitly referenced. I can still see that Hosting is used with HttpTrigger, but I wonder how it is choosen and whether we can rely on this to be available. Maybe you can cast more light on how http is actually hosted?
  3. Asp.Net Core does not support W3C protocol yet (we are discussing it now, and may have support for it in 2.2). So the ASP.NET Core instrumentation would not solve everything and we'd need to support W3C on Trigger level now anyway

So all these concerns are solvable, but it seems re-using this auto-collection now may do more harm then good. I'd prefer if we configured it after AspNEt Core 2.2 is released and AppInsights AspNetCore SDK is split into seveal packages (so no unwanted dependencies introduced for WebJobs).

Assuming this, in HttpTrigger, we'd just parse headers and set the on the Activity and pass it to the executor. Executor will pass it to AppInsights and it knows what to do with the Activity.

brettsam commented 6 years ago

Note -- we moved this to Active Questions so it falls off of our "Untriaged" query. It's more of a "spec" and not really a direct issue.

riha commented 5 years ago

Sorry, but I don't really understand the status of things here - the request-id from an incoming http request is still not picked up and tracked, right? Any idea when this might be released?

brettsam commented 5 years ago

@lmolkova -- can you update this with the current status?

brhinescot commented 5 years ago

Do plans exists to add this functionality to OrchestrationTriggers? Right now an HttpTrigger picks up the Request-Id and sets up the Activity correctly. However, when subsequently calling an OrchestrationTrigger from a DurableOrchestrationClient the Activity.Current is a new, unrelated Activity in the OrchestrationTrigger function.

marknblueswitch commented 5 years ago

Are there any plans to make Azure Storage Queue auto trackable?

I've tried out a few things to enable end-to-end correlation when Azure Queues are involved, but couldn't find a way to make it work. Even if I manually serialize a ParentID as part of my Queue payload, I can't find a way to use it to set the Request telemetry ParentID for the Function that consumes the Queue message.

I guess what I'm asking is if there are plans to:

  1. Automagically enable end-to-end tracking for Azure Queue messages across different systems
  2. Allow the ability to set the ParentID manually in the Function telemetry, so I can serialize a ParentID in the queue payload and set it myself.

Thanks!

ssa3512 commented 3 years ago

Seconding @marknblueswitch

I guess what I'm asking is if there are plans to:

1. Automagically enable end-to-end tracking for Azure Queue messages across different systems

2. Allow the ability to set the ParentID manually in the Function telemetry, so I can serialize a ParentID in the queue payload and set it myself.

This is something that would be extremely useful - we are looking to use Storage Queues over Service Bus due to the simplicity (Service Bus has a lot of extra features we just don't need, and doesn't have any local development/emulator support) but the lack of distributed tracing is a real bummer. It would be really nice to provide some sort of mechanism to allow setting the ParentID from a property in the JSON serialized message body.

Are there any plans to support something like this or another mechanism of distributed E2E tracing with storage queues?

ssa3512 commented 3 years ago

@fabiocav @lmolkova is distributed tracing supposed to currently work for function to function queue messages? I ran across QueueCausalityManager which sets the $AzureWebJobsParentId value in the JSON being sent into the queue, but I don't see the app insights request associated with the second function execution being correlated. I was trying to see if adding $AzureWebJobsParentId with the operation ID of an external operation would also cause correlation, but it doesn't even appear to currently happen for function to function messages in the same app.

MuhamedSalihSeyedIbrahim commented 3 years ago

In queue storage & blob storage - Function-to-function only E2E logging means what? if it offer distributed log tracing and how to achieve it ?

dlebee commented 2 years ago

I created this repository to reproduce this issue.

https://github.com/dlebee/corrolation-function-servicebus

More and more library are trying to implement OpenTelemetry in C# which heavily depends on higher version of System.Diagnostics.DiagnosticSource

This is causing azure function service bus trigger correlation issues.


The following is what I think is causing the issue

<= 4.6.0


public partial class Activity
{
public static Activity Current
{
#if ALLOW_PARTIALLY_TRUSTED_CALLERS
[System.Security.SecuritySafeCriticalAttribute]
#endif
get
{
ObjectHandle activityHandle = (ObjectHandle)CallContext.LogicalGetData(FieldKey);
// Unwrap the Activity if it was set in the same AppDomain (as FieldKey is AppDomain-specific).
if (activityHandle != null)
{
return (Activity)activityHandle.Unwrap();
}
        return null;
    }
}

}


VS

> Since 4.7.1 >= or around there

```csharp
public partial class Activity
{
    private static readonly AsyncLocal<Activity> s_current = new AsyncLocal<Activity>();

    /// <summary>
    /// Gets or sets the current operation (Activity) for the current thread.  This flows 
    /// across async calls.
    /// </summary>
    public static Activity Current
    {
        get { return s_current.Value; }
        set
        {
            if (ValidateSetCurrent(value))
            {
                SetCurrent(value);
            }
        }
    }
}