MAUI/iOS - Spans recorded to wrong transaction when creating multiple transactions simultaneously

getsentry / sentry-dotnet

Sentry SDK for .NET

https://docs.sentry.io/platforms/dotnet

MIT License

579 stars 206 forks source link

MAUI/iOS - Spans recorded to wrong transaction when creating multiple transactions simultaneously #3546

Closed MichaelLHerman closed 1 week ago

MichaelLHerman commented 3 weeks ago

Package

Sentry

.NET Flavor

.NET

.NET Version

8.0

OS

iOS

SDK Version

4.10.1

Self-Hosted Sentry Version

No response

Steps to Reproduce

I am trying to migrate a .NET Maui app from AppCenter to Sentry. We were using AppCenter's Event logging extensively for analytics. I am trying to replace that with spans and breadcrumbs but I'm running into issues regarding global scope.

The app is offline capable, so it saves changes in a database and has a "sync queue" to make service calls if online, retry if unsuccessful, etc.

I'm instrumenting both user actions (button clicks) and the sync queue work for analytics and troubleshooting. To do so, I create transactions around every ICommand implementation and every run of the sync queue. Since I don't want to modify my whole codebase passing transactions/scopes to the service layer (for event logging), or database and api layer (for span creation), I am relying on setting the Scope's Transaction for the data layer to create its child spans. I'm using SentryHttpMessageHander to generate spans for api calls, which also relies on the Scopes Transaction.

Since both UI actions and background work happen at the same time, they override the current Scope Transaction and spans and breadcrumbs get logged to the wrong transaction, losing important data.

Reproducible Sample

Clone demonstration https://github.com/MichaelLHerman/SentryMobileScope, change Sentry Dsn
Run app targeting iOS simulator
Click "Click Me" button several times in rapid succession
Observe in Sentry that "remove" and "get" spans are being logged under the "test.ui.command" "click" transaction. "get" and "remove" are only being logged by the sync queue, they should only appear under the "send" transaction. "click" should only have "save"'s.

Expected Result

I think client apps need a way to have Global Scope turned off.

I'd like guidance on how to turn IsGlobalModeEnabled to false. I tried doing so and tried surrounding each transaction in a _hub.PushScope() / .Dispose() but from inspecting the variables in the debugger, it appeared that the Scope stack was not getting popped on dispose and was continuously increasing.

Actual Result

jamescrosswell commented 3 weeks ago

Thanks for the sample app @MichaelLHerman - I can easily reproduce this.

The question is now what to do about it 😄

jamescrosswell commented 3 weeks ago

@MichaelLHerman currently you have a single instance of DataService/Database shared between MainPage and the WorkQueue:

MainUIThread  ➙ MainPage  ➘
                    DataService ➙ DataBase
ThreadPool    ➙ WorkQueue ➚

Essentially you want spans created by the DataService/DataBase to be children of either the transaction created by the MainPage or the WorkQueue... depending on which one is calling the DataService/DataBase method.

I wondered if the active thread could be used to determine this. However, you could end up on the thread pool from async/await calls from either the MainPage or the WorkQueue, so I don't think this works. You could infer the calling context from a stack trace, but that would be extremely expensive so, again, not a great option.

As such, I think there are only two ways to achieve what you want:

Either you create separate instances of DataService/Database and provide each service with enough information to be able to determine its calling context
Or you pass the calling context to every method that you use from the DataService/Database

Of the two approaches, the cleanest seems to be option number 1. I did a bit of hacking around to create a variant of the repository that you shared to get something like the isolation you're after. It's not very clean... it could be tidied up with the use of Interfaces and dependency injection, but as a POC I think it works:

https://github.com/jamescrosswell/SentryMobileScope

MichaelLHerman commented 3 weeks ago

@jamescrosswell Thanks for the sample - definitely very helpful and I'll dive more into it tomorrow how I could create a "Dual"ton pattern for IsolatedTransactionManager and pass that into Transient/PerInstance "Service"s, but I'm sure I'll get pushback from my team on making too many changes to the app.

I also don't think the IsolatedTransaction idea would work with SentryHttpMessageHandler, which has the same problem - apis are called from both Commands/ViewModels and from the work queue (both through Service Layer -> Api Layer -> Api Call) and its implementation uses _hub.GetSpan. Would need custom implementation.

I wondered if the active thread could be used to determine this. However, you could end up on the thread pool from async/await calls from either the MainPage or the WorkQueue, so I don't think this works.

The UI thread has thread-affinity where all awaits will return to the UI thread and (at least right now in my app) the work queue is currently always single-threaded on a task pool thread (but could change to parallelized in the future). Wondering how does this work in ASP.NET's usage of Sentry with IsGlobalModeEnabled=false? I thought Sentry's use of AsyncLocal was for this exact problem - isolated scopes for multiple requests at once. I'll provide a sample branch tomorrow of what I tried on that route.

jamescrosswell commented 3 weeks ago

I'm sure I'll get pushback from my team on making too many changes to the app

I think that's fair. It gives a flavour for some things you can do with Sentry's lower level API concepts but I'm not sure how well this would scale and I'm not sure I'd recommend using it in a complex/production app.

Disabling GlobalMode

Wondering how does this work in ASP.NET's usage of Sentry with IsGlobalModeEnabled=false? I thought Sentry's use of AsyncLocal was for this exact problem - isolated scopes for multiple requests at once.

It is yes. That used to be the only mode that Sentry supported actually. IsGlobalModeEnabled=true was introduced because of issues using an AsyncLocalScopeStackContainer with GUI apps:

https://github.com/getsentry/sentry-dotnet/issues/628

If you disable Global mode in a MAUI app then you expose yourself to those issues. You might not encounter these initially but if/when you do, there's no easy workaround.

Alternative Solution - OpenTelemetry

I played around with a completely different approach, which is to use Sentry's OpenTelemetry integration and to use OpenTelemetry instrumentation for all the tracing:

https://github.com/jamescrosswell/SentryMobileScope/tree/otel

I think it's much cleaner than the bespoke isolated scope management I shared earlier.

MichaelLHerman commented 2 weeks ago

I am trying out GlobalMode since I think logging data to the wrong transaction is a bigger blocker than https://github.com/getsentry/sentry-dotnet/issues/221. I'll follow up in the next few weeks on how that goes.

In the meantime, I have 2 suggestions on the documentation side of things.

https://docs.sentry.io/platforms/dotnet/enriching-events/scopes/#how-the-scope-and-hub-work

Effectively this means that when you spawn a task in .NET and the execution flow is not suppressed all the context you have bound to the scope in Sentry will flow along. If however you suppress the flow, you get new scope data.

This section could use some clarification. Not sure what "suppressing the flow" means and this doesn't mention GlobalMode at all.
https://docs.sentry.io/platforms/dotnet/tracing/instrumentation/opentelemetry/ I didn't look at re-implementing my proof of concept using OpenTelemetry because this doc page didn't have any details/examples about logging "additional data" via OpenTelemetry. After reading this page, I don't know how to get extra data (Extras/Context, Breadcrumbs, Tags) to Sentry; if that's at all possible or how it would look like on Sentry.io

jamescrosswell commented 1 week ago

2. https://docs.sentry.io/platforms/dotnet/tracing/instrumentation/opentelemetry/ I didn't look at re-implementing my proof of concept using OpenTelemetry because this doc page didn't have any details/examples about logging "additional data" via OpenTelemetry. After reading this page, I don't know how to get extra data (Extras/Context, Breadcrumbs, Tags) to Sentry; if that's at all possible or how it would look like on Sentry.io

@MichaelLHerman there's an example of how you can store extra information using the OpenTelemetry APIs in the samples: https://github.com/getsentry/sentry-dotnet/blob/43e28adab2859d8defc0c67692e3e00b4b2060bc/samples/Sentry.Samples.OpenTelemetry.Console/Program.cs#L34-L51

You could do the following to set the User.Email:

 using (var activity = activitySource.StartActivity("Main")) 
 { 
+    SentrySdk.ConfigureScope(scope => scope.User.Email = "some.user@foo.com");

     // This creates a span called "Task 1" within the transaction 
     using (var task = activitySource.StartActivity("Task 1"))

Any context you store on the Sentry Scope will also get applied to transactions instrumented using the OpenTelemetry Activity classes.

jamescrosswell commented 1 week ago

https://docs.sentry.io/platforms/dotnet/enriching-events/scopes/#how-the-scope-and-hub-work

Effectively this means that when you spawn a task in .NET and the execution flow is not suppressed all the context you have bound to the scope in Sentry will flow along. If however you suppress the flow, you get new scope data.

This section could use some clarification. Not sure what "suppressing the flow" means and this doesn't mention GlobalMode at all.

I think that's a fair comment. Most .NET developers probably never have to worry about flowing or suppressing the flow of execution context, since .NET does a good job of making all of that magically do what you want. We could try to give a concise/simplified explanation in the Sentry docs, but anyone craving detail should read Toub's explanation. Possibly we just link to that instead.

jamescrosswell commented 1 week ago

Alternative Solution - OpenTelemetry

I played around with a completely different approach, which is to use Sentry's OpenTelemetry integration and to use OpenTelemetry instrumentation for all the tracing:

https://github.com/jamescrosswell/SentryMobileScope/tree/otel

@bitsandfoxes I did a bit of digging. I think the reason this works when using OpenTelemetry instrumentation for tracing is because the OpenTelemetry Activity.Current static variable is an async local: https://github.com/dotnet/runtime/blob/5535e31a712343a63f5d7d796cd874e563e5ac14/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Activity.Current.net46.cs#L10

So effectively when using Sentry with IsGlobalMode=true and OpenTelemetry for tracing, you end up with a bit of a hybrid between the global Sentry scope (which stores tags, breadcrumbs, Contexts etc.) and the async local "scope" (it's not called that, but whatever) used to store OpenTelemetry spans and any extra data associated with these.

In the case of traces, I think that's probably a superior solution. I can't think of a scenario where you'd want "Global Mode" for traces... with spans from one thread being woven into spans from another thread and reported as part of the same transaction.

So I think we need to do something about Scope.Transaction... that particular scope property, I think, should probably be async local (irrespective of whatever scope manager has been configured). Or perhaps, since that would be quite confusing I think, we remove the Transaction property from the scope entirely (it shouldn't really live on the scope) and we store the current transaction somewhere else... Possibly on a SentryTransaction.Current static variable (same way Activity.Current works).

That would also make the tracing experience more consistent (regardless of whether you were instrumenting using the Sentry APIs or the OpenTelemetry APIs).

jamescrosswell commented 1 week ago

@MichaelLHerman I've created #3590 to replace this issue.

@bitsandfoxes maybe we want to fix this in the 5.0.0 release, since it's a fairly fundamental change to the behaviour of the SDK (for the better, I believe, but still).