getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.5k stars 4.11k forks source link

Cron monitors seem to exhibit inconsistent behavior across environments #71671

Open tbloncar opened 3 months ago

tbloncar commented 3 months ago

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

  1. Set up a cron monitor that is used in multiple environments (e.g., staging and prod)
  2. Configure the monitor such that it runs at the top and bottom of every hour (0,30 * * * *), has a 4-minute grace period, and has a 1-minute max runtime
  3. Run task with some runtime variance in both environments

Expected Result

Monitor failure is triggered when the task fails to check in within 4 minutes OR runs for longer than 1 minute.

Actual Result

Monitor is seemingly erroneously triggered in one environment, but not the other. Notice that, in both cases, neither of the above conditions are met. However, for the prod environment, we get a failure with a failure reason of "check-ins detected." I've confirmed with CloudWatch logs that the prod task ran for the 49 seconds indicated by the UI here.

Screenshot 2024-05-29 at 12 26 24 PM

Product Area

Crons

Link

No response

DSN

No response

Version

No response

getsantry[bot] commented 3 months ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 3 months ago

Routing to @getsentry/product-owners-crons for triage ⏲️

evanpurkhiser commented 3 months ago

The failure reason of "check-ins detected" is a bug as described in https://github.com/getsentry/sentry/issues/71179

I'm not 100% sure why you would get a timeout line that, but I think some latency may be at play with when the completion check-in was sent.

tbloncar commented 3 months ago

Thanks for the info, re: "check-ins detected," @evanpurkhiser. Just to confirm my understanding, you're thinking there's some latency on Sentry's side that is preventing the system from recognizing that the check-in occurred prior to the runtime deadline? I've temporarily bumped the max runtime setting on our side to see if this helps.

evanpurkhiser commented 3 months ago

Yeah. The check-in is definitely completing since it has a reported duration. Usually, the duration would be longer than the timeout in this case, so I suspect somewhere in the system there is some latency with the completion check-in making it's way to sentry.

The general architecture is that

  1. SDK sends complete check-in
  2. check in is received by Relay
  3. Relay produces the check-in into a queue.
  4. Queue is processed

Once the check-in is in the queue, the timestamp of when the check-in reached the queue is where it will live in terms of processing time. The whole system moves at the speed of the queue

It gets difficult to have tight tolerances when the completion check-in get's that close to the timeout time. In general, I would recommend giving a bit of buffer room if you expect it to run for that long.

Let us know if you still see problems even with a bit more of a timeout margin!