dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Tracing: Recorded flag not propagated between activities #9011

Closed Costo closed 3 months ago

Costo commented 4 months ago

Hi,

I'm experiencing something similar to this issue in Orleans: https://github.com/open-telemetry/opentelemetry-dotnet/issues/4474 The recorded flag is not propagated between activities, making impossible to sample or exclude whole traces from the root activity.

I was able to fix the issue with a 1-line change to ActivityPropagationOutgoingGrainCallFilter. See the diff here: https://github.com/dotnet/orleans/compare/main...Costo:orleans:fix/recorded-flag?expand=1

I'm not sure I understand what the problem is and why my change works. I tried to reproduce the issue in ActivityPropagationTests but did not succeed.

However, I can give the graph below as a "proof" that the fix works. When we migrated to from Orleans 3.5 to Orleans 8.0 with OpenTelemetry, we quickly realized that something was wrong as we were quickly hitting the 100GB daily cap every day. Since the fix was applied, it is much better, even if we still have some root traces we can filter out.

Screenshot 2024-05-17 155703

ReubenBond commented 4 months ago

Thanks for reporting, @Costo! @noahfalk are you able to advise us on what we should be doing here - is the one-line fix @Costo used the right fix? https://github.com/dotnet/orleans/compare/main...Costo:orleans:fix/recorded-flag?expand=1

noahfalk commented 4 months ago

Yeah, I think that is correct. OpenTelemetry requested we add this extra IsRemote bit to ActivityContext that is intended to indicate whether an ID originated from a remote machine across the wire or it was captured from some locally running Activity. Their sampling logic allows folks to use different heuristics depending on whether the parent is remote or not. However we already had APIs like ActivitySource.CreateActivity(string traceId, ...) that didn't accept the extra bit so they had to make an arbitrary choice whether to assume isRemote=true or isRemote=false for these calls. We wound up with setting isRemote=false as the default. In hindsight I wish we'd gone the other way but we've got back compat so no plans to change it. This fix works around that questionable choice of default by creating the ActivityContext yourself with the bit set correctly and then calling an overload that takes the ActivityContext.

Let me know if you've got any questions. Hope that helps :)

noahfalk commented 4 months ago

fyi @tarekgh

ReubenBond commented 4 months ago

Thanks for the explanation, @noahfalk!

ActivityContext.TryParse(traceParent, traceState, isRemote: true, out ActivityContext parentContext);
activity = source.CreateActivity(context.Request.GetActivityName(), ActivityKind.Server, parentContext);

When can TryParse fail here? If it does, should we be passing traceParent to CreateActivity instread of parentContext? Is this the ideal solution, or is there something superior we should be considering?

noahfalk commented 4 months ago

It fails when the format of traceParent doesn't match the W3C spec (code here: https://source.dot.net/#System.Diagnostics.DiagnosticSource/System/Diagnostics/ActivityContext.cs,70)

If it does, should we be passing traceParent to CreateActivity instread of parentContext?

There is an older .NET specific id format called hierarchical ID that ActivitySource.CreateActivity() also accepts, but I'm not sure if you'd every encounter it in your scenarios? For .NET Framework apps it is the default format produced when creating Activities but on .NET Core apps you'd only get it by explicitly opting into the legacy behavior which I suspect almost nobody does. If your code is .NET Core only I don't think you'd be giving up much by saying you only support W3C TraceParent IDs.

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing-concepts#activity-ids

ReubenBond commented 4 months ago

Great, thanks again. We'll likely merge this, then. @Costo would you be happy to open a PR based on your change?

ReubenBond commented 3 months ago

Fixed by #9016