Enable sample based tracing to track issues in high performance/load scenarios

Eventsource can provide a lot of useful information for diagnosing problems. It works well when you can isolate the machine and application so that it is the only source of trace messages.

If however you are trying to analyze performance or errors in a high load environment, then turning on diagnostic logging will produce a fire-hose of information, and the production of that information will have an impact on the performance (the schrodinger's cat problem).

What you ideally want to be able to do is to sample the tracing, but sampling should be on a "request" basis. You want to be able to mark a request as sampled, and then collect detailed logs, just for those requests. This needs to flow through the async call stacks and trace all work that is done on its behalf.

As an illustration, we have been trying to do better analysis of YARP performance, to better understand where time is going in processing requests, and why we are not getting as good results as some other products. In the test lab, we are running around 200k req/s and each request is taking ~800ms. If we turn on tracing, it will have an adverse effect on perf. What we want to be able to do is set a flag on maybe 0.1% of requests and get detailed logging for them.

@MihaZupan has done an experiment where he is using Activity.Record as the bit to indicate the request should be sampled, and a custom build of System.Net. He came up with the following results:

results

Column	Definition
win-noevents-sampled	regular YARP with no EventSource enabled
win-all-net-verbose-base	enable all System.Net.* and InternalDiagnostics EventSources without any changes to the instrumentation
win-all-net-verbose-sampled	enable all System.Net.* and InternalDiagnostics EventSources but with a patched build of runtime that does Activity.Current checks
win-sampled-enabled50percent	actively choose to sample 50% of requests (you can see the perf hit is pretty much 50% compared to enabling everything without changes)

What this says is that actually checking for Log.IsEnabled() && (Activity.Current?.Recorded ?? false) is not having an impact on performance, and so you can make sample based event profiling pay for play.

This makes me think that we should look to have a mechanism build into EventSource to be able to set a log level for activities with the record flag, so that you can easily collect sample-based traces. What is cool about this is that you can collect very detailed traces, including verbose level detail, for a small number of requests, so collect really detailed information, but while having minimal effect on the overall performance of the system. This kind of capability is required to be able to diagnose problems against the running infrastructure without affecting other customers, or overall system performance. If we can enable this for the BCL and other libraries then customers can get a much better picture of what is going on without having to rework the application to detect what is going on.

You can then either decide to mark requests at the beginning of processing to be sampled, or use distributed tracing to be able to flow that flag through a graph of services.

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti See info in area-owners.md if you want to be subscribed.

Issue Details

Eventsource can provide a lot of useful information for diagnosing problems. It works well when you can isolate the machine and application so that it is the only source of trace messages. If however you are trying to analyze performance or errors in a high load environment, then turning on diagnostic logging will produce a fire-hose of information, and the production of that information will have an impact on the performance (the schrodinger's cat problem). What you ideally want to be able to do is to sample the tracing, but sampling should be on a "request" basis. You want to be able to mark a request as sampled, and then collect detailed logs, just for those requests. This needs to flow through the async call stacks and trace all work that is done on its behalf. As an illustration, we have been trying to do better analysis of YARP performance, to better understand where time is going in processing requests, and why we are not getting as good results as some other products. In the test lab, we are running around 200k req/s and each request is taking ~800ms. If we turn on tracing, it will have an adverse effect on perf. What we want to be able to do is set a flag on maybe 0.1% of requests and get detailed logging for them. @MihaZupan has done an experiment where he is using Activity.Record as the bit to indicate the request should be sampled, and a custom build of System.Net. He came up with the following results: ![results](https://raw.githubusercontent.com/samsp-msft/scrap/main/MicrosoftTeams-image.png) | Column | Definition | ----- | win-noevents-sampled | regular YARP with no EventSource enabled | | win-all-net-verbose-base | enable all System.Net.* and InternalDiagnostics EventSources without any changes to the instrumentation | | win-all-net-verbose-sampled | enable all System.Net.* and InternalDiagnostics EventSources but with a patched build of runtime that does Activity.Current checks | | win-sampled-enabled50percent | actively choose to sample 50% of requests (you can see the perf hit is pretty much 50% compared to enabling everything without changes) | What this says is that actually checking for `Log.IsEnabled() && (Activity.Current?.Recorded ?? false)` is not having an impact on performance. This makes me think that we should look to have a mechanism build into EventSource to be able to set a log level for activities with the record flag, so that you can easily collect sample-based traces. What is cool about this is that you can collect very detailed traces, including verbose level detail, for a small number of requests, so collect really detailed information, but while having minimal effect on the overall performance of the system. This kind of capability is required to be able to diagnose problems against the running infrastructure without affecting other customers, or overall system performance. If we can enable this for the BCL and other libraries then customers can get a much better picture of what is going on without having to rework the application to detect what is going on. You can then either decide to mark requests at the beginning of processing to be sampled, or use distributed tracing to be able to flow that flag through a graph of services.

Author:	samsp-msft
Assignees:	-
Labels:	`area-System.Diagnostics.Tracing`, `untriaged`
Milestone:	-

@samsp-msft, I think your idea here is a good one. It's actually been attempted before - around 2014. There were a couple of issues that ended up causing us not to ship it.

It was not clear how we were going toggle the sampling bit - specifically how could we toggle it early enough to capture everything. This was back when we were dealing with IIS and .NET Framework - the landscape is certainly different now.
It was not clear which requests should be sampled. A random sampling is fine as far as I'm concerned, but it would also be nice to be able to turn on sampling for anomalies, such as long requests. It might just be too late though.
We did not have the right events in place to do async causality.

Attempting this again seems very reasonable - I suspect the async causality may still be an issue, though probably more because the events are very verbose and can have a significant impact on cost. Out of curiosity, did the experiment also enable the TPL events, or was this just for the System.Net.Http events?

In terms of catching everything, are you talking about before .NET "gets its hands on" the request. That story is pretty different now with Kestrel and ASP.NET Core. There are ways to have access very early in the request processing. This is done today for TLS and DoS prevention.
That is where Activity.Record comes in - its the property defined for this purpose for Distributed Tracing. So it can be activated by sending in a request with that header as part of Open Telemetry. I would also expect to have app code at the early part of the request chain be able to decide that it wants to sample say 0.1% of requests and use a counter/randomizer to pick which requests it sets the bit for.
Is that now in place with the activity work?

AFAIK the experiment was just with System.Net as that is something that @MihaZupan is intimately familiar with. I would love your expertize on how we can make this a real part of the product. In terms of test scenarios, if it can help us with YARP performance, then I think it will have proven its worth.

Out of curiosity, did the experiment also enable the TPL events, or was this just for the System.Net.Http events?

It was everything in *System.Net.*, so in the YARP benchmark that means:

System.Net.Http
System.Net.NameResolution
System.Net.Security
System.Net.Sockets
Private.InternalDiagnostics.System.Net.Http
Private.InternalDiagnostics.System.Net.NameResolution,
Private.InternalDiagnostics.System.Net.Security
Private.InternalDiagnostics.System.Net.Sockets

The System.Net.* sources are designed to be efficient, and you can turn all of them on with minimal performance impact. The InternalDiagnostics are there to log "everything" that may be useful for debugging (including buffers), and normally come with a significant performance overhead.

I suspect the async causality may still be an issue

Brian for your concern around async causality are you worried about:

Flowing the sampling information at runtime
Reconstructing the causal links between work happening on different threads in the final trace
Constructing async-aware stack traces for individual events
Something else?

For (1) we use AsyncLocal and distributed tracing already depends on us doing a good job of it. As far as I know there are no significant issues at this point. For (2) I don't think it is must-have to make the sampled logs useful. We can stamp events with Activity IDs which correlates them together per-request. In simple cases there won't be any parallelism intra-request so timestamps are sufficient to infer causation. In more complex cases that do use parallelism then someone might need to acquire a little domain knowledge if they need to infer causation from the events that got logged. For (3) David Mason is doing an experiment to see if we can construct those stacks at runtime without emitting any TPL events. Even if we couldn't I don't think we are any worse off than traces we collect today without TPL events. For example in the YARP case shown here you might not know why we are processing a particular request, or even how the request code makes call within itself, but I assume you could still get lots of interesting information about the latency distribution and identify what portions of request processing take the most time.

it would also be nice to be able to turn on sampling for anomalies, such as long requests. It might just be too late though.

In distributed tracing this is often called head-based and tail-based sampling. Head-based is when you decide up-front and tail-based is when you capture everything and decide after-the-fact what to keep. As a starting point I'd propose we stick to head-based because it is simpler and lower overhead. If we want to add a tail-based option that can be a future feature.

It was not clear how we were going toggle the sampling bit - the landscape is certainly different now

Yeah, as long as it is based on Activity.IsRecorded there is a well-defined mechanism to set that bit using the ActivityListener.Sample callback.

dotnet / runtime

Enable sample based tracing to track issues in high performance/load scenarios #77520