getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
580 stars 206 forks source link

Saas Backend Rate Limiting Errors #661

Closed Cooksauce closed 3 years ago

Cooksauce commented 3 years ago

Please mark the type framework used:

Please mark the type of the runtime used:

Please mark the NuGet packages used:

I'm getting Sentry reporting failures when trying to report a list of event's to Sentry.

I cannot find any documentation about there being rate limit outside of the billing quota.

As an attempt to avoid hitting the limit I have 100ms delay between the SentrySdk.Capture...() calls. What is an appropriate event rate?

Ex:

warn: Sentry.ISentryClient[0]
      The attempt to queue the event failed. Items in queue: 100

...

fail: Sentry.ISentryClient[0]
      Sentry rejected the event 970d918dad84422193143617622621cc. Status code: TooManyRequests. Sentry response: No message
Tyrrrz commented 3 years ago

The rate limiting is internal, so seeing as it surfaced is a sign of a bug in Sentry. How many events do you send per minute, approximately?

Cooksauce commented 3 years ago

I don't really have a good event/min number for you. Basically I have a cron job which runs, and can have some significant amount of error data (too big for one request) to report back to Sentry. This turns into a several hundred item long list of events to send. I'm fine limiting how fast I dump them into Sentry, but I can't do that when it's a hidden number, and the SDK just crashes when you reach that number.

IMO, the SDK should make it more obvious when something like this happens. It seems like this is something that could easily be reported back to the Saas platform to tell the end-user (me) that Sentry can't keep up with the event rate. Without something like this, I lose a lot of trust that Sentry is not dropping events.

bruno-garcia commented 3 years ago

The rate limiting is internal, so seeing as it surfaced is a sign of a bug in Sentry. How many events do you send per minute, approximately?

I'm not sure I agree with it. Why is it a bug that he sees he's being rate limited? Also, he sees it because the SDK is in debug mode. Sentry has a Stats page that shows how many events Sentry dropped due to rate limiting.

I don't really have a good event/min number for you. Basically I have a cron job which runs, and can have some significant amount of error data (too big for one request) to report back to Sentry. This turns into a several hundred item long list of events to send. I'm fine limiting how fast I dump them into Sentry, but I can't do that when it's a hidden number, and the SDK just crashes when you reach that number.

I don't follow by "the sdk crash"? Did the SDK crash your app?

From your description, the SDK did what it was designed and instead of keeping all queued events until it crashed your process due to out of memory, it started dropping events when the number if queued items was larger than the MaxQueueItems:

https://github.com/getsentry/sentry-dotnet/blob/54e1853dd26c526613706aeb54da5b9f4b221229/src/Sentry/SentryOptions.cs#L236

I see you've extended that number to 100. If you have enough ram, and want to make that 100_000, that's OK too. Now with regards to the rate limiting, there are multiple in place. There's a general abuse limit at the proxy level of 1k/s I believe, and there's Spike Protection (so you don't burn all your quota), you could be really over quota so Sentry will start dropping at some point.

As you mentioned SaaS, I recommend reaching out to Sentry support to check what's the best way for you in this case.

Cooksauce commented 3 years ago

I'm not sure I agree with it. Why is it a bug that he sees he's being rate limited? Also, he sees it because the SDK is in debug mode. Sentry has a Stats page that shows how many events Sentry dropped due to rate limiting.

I did not know about the rate limits on Stats page. However, looking at that page, it actually says that no events were rate limited. image

From your description, the SDK did what it was designed and instead of keeping all queued events until it crashed your process due to out of memory, it started dropping events when the number if queued items was larger than the MaxQueueItems

Right, crash is not the best descriptor. What I meant was just that the SDK was failing to report the events. The thing is, this is a paid service for the sole purpose of monitoring errors in a production application. That comes with an increased expectation of reliability. That said, it is certainly reasonable to have limits (both in the Saas product & the SDK) - but to have those limits being hit (and thus events being dropped) without the user being informed about it is, in my opinion, an undesirable user experience.

If the SDK gets a TooManyRequests response, is it supposed to wait & retry or just drop the event?

There's a general abuse limit at the proxy level of 1k/s I believe, and there's Spike Protection (so you don't burn all your quota), you could be really over quota so Sentry will start dropping at some point.

The event rate was definitely not anywhere near 1k/s. With a 100ms delay, that puts it somewhere near 10/s. Everything in the Saas platform also indicates there is plenty of room in our quota.

Edit: Spike Protection is off as well

bruno-garcia commented 3 years ago

That comes with an increased expectation of reliability. That said, it is certainly reasonable to have limits (both in the Saas product & the SDK) - but to have those limits being hit (and thus events being dropped) without the user being informed about it is, in my opinion, an undesirable user experience.

I don't believe there's any failure in reliability in here. There seem to be an issue in transparency and expectation of what rate the server is designed to support and how the user can be aware if things are being dropped. I agree with that.

This isn't an issue with the SDK since the SDK is logging out the response from the server, which is 429 TooManyRequests. As a back pressure mechanism it buffers up to MaxQueueItems as mentioned before, and drops the rest.

I suggest you raise an issue with support because 10 events per seconds should not trigger a 429 from Sentry. Please send a link to this issue for context.