Closed NickDarvey closed 2 years ago
Great write-up!
few comments:
there will be changes in Activity to support W3C distributed tracing, however, they will be backward-compatible and does not block implementation.
Using W3C headers right away does not make sense as they require very specific format and Activity does not generate this format yet.
AppInsights leverages Activity and DiagnosticSource for correlation and telemetry collection.
DiagnosticSource provides two different levels of details and instrumentation should enable them:
I can help with any details and provide examples, just let me know.
We are definitely interested in providing rich diagnostics to Orleans users, especially when they cross the boundaries to other systems. I think we should tentatively consider this for Orleans 3.0.0, and better understand the developer experience and performance impact of this.
From reading various threads/docs on DiagnosticSource
& EventSource
, it seems that we should be using both within Orleans:
DiagnosticSource
along a requestDiagnosticSource
events (RequestStart, etc) for telemetry and to ensure a good distributed tracing storyEventSource
events for perf investigations & diagnostic tracingEventCounter
to replace our current countersDoes that sound right to you, @NickDarvey & @lmolkova?
This is something which we wont be able to address in the immediate future, but as Sergey said above, we should consider this for 3.0.
I'd defer to @lmolkova, but that sounds sensible to me.
You might have already looked in to this but I saw that there is a bridge from DiagnosticSource to EventSource which might mean you can do all the instrumentation with DiagnosticSource.
@ReubenBond I agree with your plan. DiagnosticSource as an instrument for distributed tracing is not the best choice for verbose internal diagnostics.
You may still use DiagnosticSource for verbose perf and diagnostics logging by using two different DiagnosticSources for distributed tracing and internal debugging, but EventSoiurce is more optimal for the latter. It also provides out-of-process monitoring with ETW and LTTng.
DiagnosticSource to EventSource bridge may be hard to use as it requires some special configuration and even defines a language to serialize DS payloads in EventSource events.
I'm still trying to wrap my head around DiagnosticSource
and DiagnosticListener
, but my thoughts were along a very simple practical implementation:
Web API project --> Orleans Silo (grain) --> external service via http
I noticed that by default, ILogger
contains a plethora of request-related information, including ConnectionId
, RequestId
, etc. I want to flow this information through the system towards the http client. I hope I haven't got it all backwards (?), but this is how I implemented it until better guidance is available. I understand you have this planned for later Orleans release, but I need something I can use now.
HttpContext.TraceIdentifier
is available in the Web API pipeline, I created a simple middleware to set the RequestContext
with specific key to that value:
app.Use(async (ctx, task) => {
RequestContext.Set(RequestContextKey.RequestId, ctx.TraceIdentifier);
await task.Invoke();
});
ILogger<T>
abstractions throughout the code), I thought about supplying this information in as general way as possible. I got inspiration from AspNetCore, where they use ILogger.BeginScope
to add information to a logging scope. Since I'm already using an incoming grain call filter to catch and log exceptions, I added it there:
var requestId = RequestContext.Get(RequestContextKey.RequestId) as string;
using (var scope = _logger.BeginScope(new Dictionary<string, object> { { "RequestId", requestId}})) {
await context.Invoke();
}
RequestId
so that it gets forwarded to other endpoints automatically.This approach gets me the RequestId
from the web api, to the Orleans silo in "as ambient" way as I could find, without passing the RequestId
all over the place.
Perhaps I got it backwards? Thoughts?
@skyflyer
- The "httpclient" part
There're 2 ways for HttpClient header injecting I know:
a) The DiagnosticsHandler
in HttpClient produces several diagnostic events, so you can fetch the HttpRequestMessage when you subscribe related events, such as "System.Net.Http.HttpRequestOut.Start", and inject RequestId to message headers.
b) HttpClient provides a "middleware-like" way to configure Http request-response handling (much like Orleans' OutgoingGrainCallFilter) - you can add a custom handler to the client, and change request's header in that handler. like this:
public class TracingHandler: DelegatingHandler
{
public override Task<HttpResponseMessage> SendAsync(HttpRequestMessage message, CancellationToken token)
{
// change headers
base.SendAsync(message, token);
}
}
a is the more universal option - it uses the .net core's diagnostic pattern, but it's sort of complex, you need to start a task/thread to subscribe and handle events, and it's not much configurable. I prefer and current using option b, with the help of the library HttpClientFactory, it's easy to configure different tracing policy for different HttpClients
@csyszf, thanks for the input. I haven't work with HttpClientFactory
(yet), looks interesting.
With my previous post, I was trying to access the same "trace identifier" that would be used "across different microservices". Given that ctx.TraceIdentifier
was the only such parameter available to me at the time, I figured that would be it, but I later noticed that Application Insights does not use this value, but rather computes its own Operation Id
in the form |61e9be9932679f4d8c3f775164d629d1.5173a375_
where TraceIdentifier
value is in the form 0HLK6O4A3D8CC:00000001
.
What is (or should be) the correct way to "correlate" all "operations" under the same (parent) id? The (simple) goal is to have one (parent) operation id available across different services -- and since we're logging everything into Application Insights (integration with ASP.NET Core + Serilog AI sink for other logging requirements), that would be really nice.
@csyszf, thanks for the input. I haven't work with
HttpClientFactory
(yet), looks interesting.With my previous post, I was trying to access the same "trace identifier" that would be used "across different microservices". Given that
ctx.TraceIdentifier
was the only such parameter available to me at the time, I figured that would be it, but I later noticed that Application Insights does not use this value, but rather computes its ownOperation Id
in the form|61e9be9932679f4d8c3f775164d629d1.5173a375_
whereTraceIdentifier
value is in the form0HLK6O4A3D8CC:00000001
.What is (or should be) the correct way to "correlate" all "operations" under the same (parent) id? The (simple) goal is to have one (parent) operation id available across different services -- and since we're logging everything into Application Insights (integration with ASP.NET Core + Serilog AI sink for other logging requirements), that would be really nice.
I'm not familiar with Application Insights, but |61e9be9932679f4d8c3f775164d629d1.5173a375_
looks like an Activity Id, an it should be avaliable in most areas which adapt System.Diagnostics
.
AspNetCore start several activities like RequestStart, so you can get this activity id if you subscribe it - and it will auto set the new started activity's parent to the current activity(with stored by AsyncLocal)
Any news?
Yes, we are actively working on this
I'll cross-reference also https://github.com/aspnet/AspNetCore/issues/5914 as those following this might be interested to see where the core framework is going and have potentially similar scenarios to the branched issues.
<Edit: Links of interest by following from ASP.NET Core to https://w3c.github.io/trace-context/ and from there to https://github.com/open-telemetry/opentelemetry-dotnet/ (and https://github.com/aspnet/AspNetCore/issues/8924).
Is this issue going to make the 3.0.0 release?
What's the progress of this issue?
The ASP.NET Core team posted a very enlightening blog post showing how the new Activity API works:
Also potentially interesting example from Firefox DevTools: Server-Timing HTTP header (also https://twitter.com/FirefoxDevTools/status/1201914691863244800).
Maybe interesting resource: https://opentelemetry.io/. Also cross-referencing to https://github.com/dotnet/orleans/issues/6178 that asks guidance for production runs.
I am working for the .NET agent of Apache Skywalking APM and want to support orleans. Like #6213 said. Can I get any help?
@ElderJames I responded on that thread
I have a branch where I was testing integration with Jaeger via OpenTracing and DiagnosticSource. Longer term, OpenTelemetry is probably the way to go but I'm not sure if they had any nuget packages at the time. I plan to complete the integration to have a better end-to-end, pay-for-what-you-use solution here.
FYI, saw some related work in https://github.com/dotnet/runtime/issues/31373
The proposal here is mainly targetting closing the gap between the .NET Libraries and OpenTelemetry (OT) tracing APIs (mainly the OT Span class).
and its linked issue https://github.com/dotnet/runtime/issues/31372
This issue is tracking the work need to be done for supporting the distributed context scenarios for ASP.NET and OpenTelemetry.
@ReubenBond Given this didn't make 3.0 -- is this on track for 4.0 ?
I link https://petabridge.com/blog/state-opentelemetry-dotnet/ and also https://github.com/dotnet/runtime/issues/31862 (in spirit of @NickDarvey) to provide context.
+1 for interest in seeing how opentelemetry integration pans out.
@ReubenBond, is it still actual to implement this feature? I added a pull request with Activity and Diagnostic listener support and it does not depend on any internal things, so everyone can use the idea. Thanks.
Goal
Distributed tracing that works out-of-the-box with other Microsoft libraries like System.Net (HttpClient), EntityFramework, Event Hubs SDK and Service Bus SDK.
Value
I want to instrument my Orleans application in a way I can correlate events so I can trace the effect of any one request across an entire system, across service boundaries. As an example, by adding a trace identifier to a HTTP request, I want to see that it reaches my gateway (in ASP.NET Core), each grain it hits (in Orleans), the messages it creates (on an Event Hub) and the database transaction it executes (via Entity Framework).
Background
I understand Orleans provides an ActivityId however it doesn't cover distributed tracing out-of-the-box (i.e. flowing that ActivityId to other systems).
I looked at how other Microsoft libraries did this (e.g. System.Net, Event Hubs SDK) and they seem to be converging on System.Diagnostics.Activity which is available as an AsyncLocal container (Activity.Current) for carrying diagnostic context. (More information about Activity is in the user guide.) Unfortunately, it's not something that is used with Microsoft.Extension.Logging but rather introduces a (yet another) new diagnostic API: System.Diagnostics.DiagnosticSource. DiagnosticSource is distinct from other logging systems in that it's to be consumed in-process and sources normally provide rich payloads. (More information about DiagnosticSource is in the user guide.)
Libraries using Activity which make or receive external calls often implement some variation of this correlation protocol. This means that the Activity is emitted on outbound requests and resumed on inbound requests. It looks like there is work underway to standardize (the header names) based on the W3C spec for distributed tracing. However, Orleans should be able to be instrumented now without being affected by these changes. (Client-to-Orleans calls are via an Orleans-provided library so I don't think a change in header names would be a breaking change.)
(There (hopefully?) won't be any change to Activity itself, depending on this discussion.)Edit: See @lmolkova's comment for details on the impact of W3C distributed tracing.Implementation
There is official guidance for Activity and DiagnosticSource but newer documentation is work-in-progress and available here.
Based on the experience of people who have consumed DiagnosticSources before, it looks like a implementation similar to Entity Framework, where the events are typed, would be preferable.
My insight only goes as deep as poking around over the last week. I've come across people who seem very involved in the .NET and distributed tracing space who might be able to chime in and enlighten us about the right way of approaching this, such as @SergeyKanzhelev, @lmolkova and @cwe1ss.
Minimal
A minimal implementation would allow the Activity to flow across service boundaries with a DiagnosticSource for grain requests. Log providers (ILoggerProvider) could optionally listen to this DiagnosticSource (or all DiagnosticSources) and enrich the logs with the Activity context.
Extended
An extended implementation might go so far as to replace ILogger use in Orleans with DiagnosticSources and provide adapters back to ILogger (and to ITelemetryProvider for performance counters). I think the value of going this far would be to offer richer diagnostics to people who want it (e.g. someone who wants to build a profiler) and a single diagnostics API for Orleans. (The minimal implementation would mean Orleans would offer both ILogger and DiagnosticSource which is ugh.)