getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
576 stars 205 forks source link

Replace `ConcurrentBag<T>` with a simpler data structure #3433

Open jamescrosswell opened 1 month ago

jamescrosswell commented 1 month ago

We appear to be using ConcurrentBag<T> in ways it possibly wasn't designed for and this may be the cause of a memory leak we discovered. We should investigate using a simpler data structure that doesn't use ThreadLocals or WorkStealingPools, if possible.

Originally posted by @jamescrosswell in https://github.com/getsentry/sentry-dotnet/issues/3432#issuecomment-2175807165

cliedeman commented 2 weeks ago

We are still experiencing high memory usage related to this

image

@jamescrosswell is there anything we can do in the short term?

bitsandfoxes commented 2 weeks ago

We should start looking into something similar to https://github.com/getsentry/sentry-dotnet/issues/829 If the SDK knows a transaction is going to get sampled then all the potential childspans can do a noop. That way users regain some control over the memory consumption via the sample rate.

jamescrosswell commented 2 weeks ago

@jamescrosswell is there anything we can do in the short term?

@cliedeman as far as I know, the original memory leak this ticket relates to was fixed in:

The purpose of this ticket was to explore a more elegant solution.

We know of a memory leak in Sentry.Profiling (which is still not GA) but otherwise it should all be airtight.

If you're seeing memory growing monotonically and you're not using Sentry.Profiling then we'd definitely like to dig deeper. The first step for us is always reproducing the problem so if you can share a small sample project that recreates the issue then we can look into it.

cliedeman commented 2 weeks ago

Hi @jamescrosswell

Profiling is not enabled.

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

image.

Thanks Ciaran

jamescrosswell commented 2 weeks ago

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

Ah, in that case maybe related to @bitsandfoxes 's comment. In the dump you've shown above, the largest consumer of memory appears to be a concurrent bag that is storing trace information... so TransactionTracer and SpanTracer instances. You'd have to be creating an absolute tonne of these to get an OOM exception though.

The other possibility is the MemoryCache shown in your dump.

You'd really need to capture a dump just before an OOM exception was thrown to be sure which was the culprit though. It may be that in normal circumstances, Sentry's concurrentbag is allocating relatively more memory than other stuff in the process but in the circumstances leading up to your OOM exception it's something else entirely.

I put together an experiment a while back with some code that you could potentially include in your application to capture the dump at the right time (just before the OOM).

cliedeman commented 2 weeks ago

These are the azure crash monitor dumps taken at crash time

jamescrosswell commented 2 weeks ago

These are the azure crash monitor dumps taken at crash time

And was that an OOM crash? So you've got less than 400MB of memory available to the process?

cliedeman commented 2 weeks ago

Unsure. The server definitely has more memory. There is no increase in server error rates or increase in request rates. I'll scrutinize the dumps later to try and see why there are so many spans because that doesn't make sense to me. I am also willing to share the dumps.

Also the server is 32bit and the OOM could be caused by a collection hitting the max size not just heap size

cliedeman commented 2 weeks ago

Let me rephrase my thinking.

Our application is crashing with OOM. There are no obvious suspects. Sentry is the highest memory user according the dumps.

There are several issues related to memory on this repo. I would like to atleast eliminate sentry from the causes.

I could turn off sentry and observe that the issues stop but Sentry is really useful.

bitsandfoxes commented 2 weeks ago

We will prioritize getting on top of the SDK's memory consumption. @jamescrosswell has a PR for some optimization experiment https://github.com/getsentry/sentry-dotnet/pull/3434 @cliedeman would you be able/willing to run an alpha build and see if that does anything to reduce this issue?

cliedeman commented 2 weeks ago

Hi @bitsandfoxes

Yes happy to do so.