getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
583 stars 205 forks source link

Backpressure Management #2410

Open mattjohnsonpint opened 1 year ago

mattjohnsonpint commented 1 year ago

Problem Statement

The internal BackgroundWorker class manages back-pressure for the SDK. When events, transactions, sessions, etc. are captured, the SentryClient adds them to a ConcurrentQueue within the background worker. The background worker then has a separate task that is responsible for reading from the queue and sending events to the ITransport implementation (typically that's the HttpTransport).

This is generally a good design. However, it suffers from the limitation that there's only one queue for all envelope types, non-prioritized, with a single limit set by SentryOptions.MaxQueueItems - defaulting to 30. If the queue is full, envelopes are dropped when captured, never being added to the queue. Therefore, it's quite possible (especially in a server environment) that a large flood of one type of envelope (such as transactions) can prevent another more important type of envelope (such as error events) from being sent to Sentry.

Solution Brainstorm

We need to support concurrency on the producer side, so the built-in PriortyQueue class is out. We need to be non-blocking on the consumer side, so BlockingCollection is also out. There is no built-in PriortyConcurrentQueue class. Thus, to implement prioritization, we will probably need to use multiple ConcurrentQueue instances in the background worker - perhaps in a dictionary where the key is the envelope type.

We also should have more than one option for controlling the maximum queue size (this might be separate properties, or a dictionary). Currently we just have MaxQueueItems - which is undocumented.

We should probably set higher default queue sizes for ASP.NET and ASP.NET Core. Currently, MaxQueueSize defaults to 30. We do show setting it higher in some of the sample apps, but with no explanation.

Whatever is decided in https://github.com/getsentry/team-sdks/issues/10 should be taken into consideration also.

We potentially have a solution in this spike (although it may need tweaking):

References

ericsampson commented 1 year ago

What about using TPL Dataflow?

bitsandfoxes commented 7 months ago

Additional context; Internal doc (sorry): https://vanguard.getsentry.net/p/clnlv0iuj0010s60q9e09ba3f

jamescrosswell commented 5 months ago

The python implementation appears to use a combination of whether the queue is full or whether requests have been rate limited to adjust a downsample factor. This is then used to reduce the number of traces that get captured.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

@bitsandfoxes are we looking to mimic what the Python SDK is doing in .NET or are we also considering strategies like the one Matt originally proposed (some way to implement priority queues).

bitsandfoxes commented 5 months ago

We do have some docs on backpressure.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

If traces get rate limited errors get downsampled?

jamescrosswell commented 5 months ago

If traces get rate limited errors get downsampled?

No the other way around... Downsampling only affects traces in the Python implementation. However if Metrics or Errors get rate limited, this would cause traces to be downsampled, if I've understood that code correctly.

bitsandfoxes commented 5 months ago

After talking to the other teams I think we should mimic the other SDKs. My reasoning:

  1. It's a lot easier and simpler
  2. It's already out there in the wild and proven to work well (enough)
  3. The idea is to still monitor and ease up on "unhealthy" systems. Transactions are the most likely culprit.
  4. The main target/scenario we'd like this to work really well is spike protection.
  5. It's not just so much to "protect" Relay from too many requests. But if the client is in such a bad state that it has trouble sending events (i.e. the queue is overflowing) we'd ideally no-op as much as possible as to not add the the trauma.