getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
39.21k stars 4.2k forks source link

Reset spike event count on disabling spike protection #78311

Open romaingd-spi opened 1 month ago

romaingd-spi commented 1 month ago

Problem Statement

Hi, and thanks for the work on Sentry!

We sometimes experience bursts of events (during some maintenance operations), which rightfully trigger spike protection - something we're happy with. However, once we have the source issue under control (i.e. the frequency of events is back to the normal baseline), our usual process recommends to disable and instantly re-enable spike protection. The goal is to force the spike to end, reset the count, and enable events to flow in normally. This is what we've been doing for several months.

It seems, unfortunately, that this no longer works (and after discussion with the support, maybe it never worked this way and we were lucky everytime): when a spike happens, and spike protection is disabled / re-enabled, the next event will re-trigger a spike if close enough temporally (even if it's alone, i.e. spike protection is no longer justified). This happened on 2024-09-23 - see the succession of spike - disable/re-enable - spike on the attached screenshot. As a consequence, when a spike happens, spike protection has to be disabled for some time (~1 hour I guess) to make sure new events are received without triggering a new spike that would not be justified.

Image

Solution Brainstorm

Notes

Matches Zendesk support request 133598 (external)

Product Area

Settings - Spike Protection

getsantry[bot] commented 1 month ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 month ago

Routing to @getsentry/product-owners-settings-spike-protection for triage ⏲️

isabellaenriquez commented 1 month ago

Hi @romaingd-spi. Just trying to understand your use case a little more. Spike protection thresholds get recalculated intermittently while a spike is taking place, so spike protection should adjust to your new baseline automatically. Are you finding that this isn't happening so you need to stop spike protection yourself?

romaingd-spi commented 1 month ago

Hi @isabellaenriquez, thanks for taking a look!

When a spike happens, we react very quickly, since we're effectively blind on events happening. We do our best to get the main issue under control (e.g. hotfix, rollback, or damage control). When this step is reached, often the spike is still active, and thus we're still blind on other incoming events. What we would need is a way to say "this spike is under control, no need to keep discarding events - but spike protection should still be enabled in case another spike happens".

To answer your question directly, we did not empirically see automatic adjustments during spikes that effectively allowed events to flow back in when we know the spike is under control. These adjustments occur, but at a rate that is (as it seems) too slow for our needs - hourly if I understand the docs correctly.

Let's say a spike happens at 12:05, and we've made sure it's under control at 12:15 (say the source issue was fixed). Our understanding is that a fraction of events from 12:05 to 12:15 are discarded (which is great, thanks spike protection), but also all events from 12:15 to 13:00 are discarded (which we want to avoid - these matter). To avoid being blind until 13:00, we'd like to have a way to reset event count. We used to think disabling / re-enabling spike protection did the trick, but it no longer works this way (if it ever did): if we disabled / re-enabled spike protection at 12:20, any singular event happening e.g. at 12:25 would trigger a new spike, and make us blind again.

We could disable spike protection entirely until 13:00, but unless there's a way to say "disable for 1 hour", that's very error-prone because manual re-enabling is required at 13:00.

isabellaenriquez commented 1 month ago

To summarize, the problem here occurs when:

  1. You have the issue that originally caused the spike under control quickly but you're still receiving a high number of events
  2. You want spike protection to adjust earlier to prevent further event discard
  3. Manually disabling spike protection to prevent further event discard would require manual re-enabling, in which re-enabling too early would cause an immediate spike, while re-enabling too late makes you vulnerable to unexpected spikes.

@romaingd-spi I think the only part I'm not understanding yet is why you'd want spike protection to stop discarding events if your event volume is still high. You're correct that spike protection thresholds during an active spike are only adjusted every hour, but spikes don't need to take the whole hour to deactivate; if the event volume is still higher than normal, it's still spiking, otherwise spike protection should stop. If we want spike protection to adjust earlier to prevent further event discard, that means increasing the baseline.

Thanks so much for explaining your case thus far!