System.Diagnostics.Activity out of proc sampling

CodeBlanch commented 5 months ago

We have DiagnosticSourceEventSource which enables listening to Activity \ traces out of process. The current design when there is a listener is to create all Activity \ spans and write everything to the EventSource. We feel for out of process monitoring of traces this will create a lot of overhead in the process being monitored.

The request here is to allow the listener to set some kind of sampling algorithm.

Here is the OTel spec for samplers: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#built-in-samplers

What we have today is essentially AlwaysOn.

One suggestion would be to allow the listener to set a ratio. If ratio is supplied we could sample according to these rules:

If there is a root and it was sampled, always create child spans.
If there is no root, always create a root span but only set the Recorded\sampled flag based on the TraceIdRatioBased algorithm using the provided ratio.

Always creating the root span is important for log correlation but we only need to create children if the root is sampled. This is how the OTel .NET SDK works today.

/cc @noahfalk @tarekgh @samsp-msft @rajkumar-rangaraj @cijothomas

dotnet-policy-service[bot] commented 5 months ago

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti See info in area-owners.md if you want to be subscribed.

julealgon commented 5 months ago

@CodeBlanch isn't there a way to achieve this without basically reimplementing abstractions that the OpenTelemetry packages have already implemented?

dotnet-policy-service[bot] commented 5 months ago

Tagging subscribers to this area: @dotnet/area-system-diagnostics-activity See info in area-owners.md if you want to be subscribed.

CodeBlanch commented 5 months ago

@julealgon Anything is possible! This issue is really to collect ideas. But my gut is no. For this out-of-proc monitoring scenario there is no OTel available. Some external utility (think dotnet-monitor) is run to extract telemetry from a running process which doesn't have the OTel SDK hooked up. We also don't want to use the profiler to inject modify/code so this works for AoT-compiled apps where that isn't possible.

julealgon commented 5 months ago

@julealgon Anything is possible! This issue is really to collect ideas. But my gut is no. For this out-of-proc monitoring scenario there is no OTel available. Some external utility (think dotnet-monitor) is run to extract telemetry from a running process which doesn't have the OTel SDK hooked up. We also don't want to use the profiler to inject modify/code so this works for AoT-compiled apps where that isn't possible.

What if we inverted the problem then?

If there are native .NET scenarios that require sampling, maybe the core System.Diagnostics libraries should contain the sampling abstractions (like they contain Activity, Meter, and other "OTEL-compatible" diagnostics classes) and the OpenTelemetry SDK could just leverage those instead of implementing them from scratch.

Otherwise, this will end up in a ton of potential duplication between .NET and OTEL.

Sure, the easy way out would be to just duplicate the concepts, but that feels like a bad solution long-term to me.

samsp-msft commented 5 months ago

@julealgon Anything is possible! This issue is really to collect ideas. But my gut is no. For this out-of-proc monitoring scenario there is no OTel available. Some external utility (think dotnet-monitor) is run to extract telemetry from a running process which doesn't have the OTel SDK hooked up. We also don't want to use the profiler to inject modify/code so this works for AoT-compiled apps where that isn't possible.

What if we inverted the problem then?

If there are native .NET scenarios that require sampling, maybe the core System.Diagnostics libraries should contain the sampling abstractions (like they contain Activity, Meter, and other "OTEL-compatible" diagnostics classes) and the OpenTelemetry SDK could just leverage those instead of implementing them from scratch.

Otherwise, this will end up in a ton of potential duplication between .NET and OTEL.

Sure, the easy way out would be to just duplicate the concepts, but that feels like a bad solution long-term to me.

This is essentially what we are talking about:

Add an API or possibly use the existing hook in ASP.NET core when a request comes in and picking the traceid/Recorded properties, to be able to specify a sampling algorithm.
Provide a default implementation of the sampling algorithm, that will sample based on a fixed percentage.
- It should probably use a hash of the traceid and compare that with the sample frequency to decide which requests should be sampled in and which are sampled out - and set the Recorded bit as applicable.
- The sample rate should be specified as a float for what percentage of requests have tracing enabled.
- The value can be set by code, and via IConfiguration - so it can be tweaked without adjusting the app code and set via an env variable
- The default should be 1.0f - so that all requests are included by default
Child activity creation will depend on whether the parent (which will be the root) activity has the Recorded bit set.

Whether to deprecate the sampling options in OTEL or have that map through to the runtime API is up for debate - what this probably maps to in the OTel Specs is the ParentBasedSampler with a fixed rate sampler for Root spans without a parent.

julealgon commented 5 months ago

Add an API or possibly use the existing hook in ASP.NET core when a request comes in and picking the traceid/Recorded properties, to be able to specify a sampling algorithm.

Just keep in mind this should work just as "natively" outside of ASP.NET Core, like cmdline apps, workers, etc.

samsp-msft commented 5 months ago

It's all based around https://github.com/dotnet/runtime/blob/5535e31a712343a63f5d7d796cd874e563e5ac14/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivityListener.cs#L48C58-L48C62, so is not ASP.NET specific, but that's one of the main consumers.

samsp-msft commented 5 months ago

dotnet/dotnet-api-docs#9978

dotnet / runtime

System.Diagnostics.Activity out of proc sampling #102895