Crons: Warn on mismatched schedule

evanpurkhiser commented 1 year ago

There is a scenario in the current implementation of Crons monitors where we probably should not be letting the user know "everything is okay"

Imagine a crontab monitor configured as 0 * * * *. This cron monitor should run on the hour every hour.

What is happening here?

The first checkin happens at 12:00. This is on time and successful
The second run happens at 1:00. This is on time and successful.
There are now five additional checkins which are sent before the next expected checkin at 2:00.

Current world

Right now record these as simply "Okay". The monitor is successful and in the UI you can see all these additional checkins.

Proposed world

If I have a monitor that runs once an hour, it may be crucial for me that it ONLY runs once per hour. This scenario is an error.

We should warn the user that they are sending unexpected checkins.

Technical implementation proposal

Instead of recording each subsequent checkin that is before the next expected checkin, we can bucket these with the most recent checkin and simply record that they happened. We can also update the status of the monitor to be extranious_checkins or something like that.

This additionally helps us reduce the number of rows within the monitorcheckin table, which is something we are currently keeping an eye on, since we are currently receiving many extraneous checkins.

getsantry[bot] commented 1 year ago

Routing to @getsentry/crons for triage, due by Monday, February 13th at 2:16 pm (sfo). ⏲️

wedamija commented 1 year ago

How would the bucketing work in a world where these are events? Since events are immutable, we can't easily merge them.

Or would we just treat these extra checkins as error checkins?

evanpurkhiser commented 1 year ago

When these are events it is not as critical that we are recording so many and we can just record them with an "extra_checkin" status and compute the count of extraneous ones with snuba, right?

wedamija commented 1 year ago

It's not as critical, but there's still a cost to storing/querying these. Since we're not charging by volume, it might make sense to still have some basic rate limiting in place.

evanpurkhiser commented 1 year ago

This is also related https://github.com/getsentry/sentry/issues/44390

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-crons for triage ⏲️

getsentry / sentry