Open anze3db opened 1 year ago
Hey @anze3db !
Thanks for reporting this.
Indeed this should not happen. The fact that daily-sync
passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.
What are the grace and timeout settings of the daily-sync
and the other sub-monitors?
The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.
Yes, I agree. I just thought I'd mention that the wrapper monitor passed. I think this should rule out any errors in my function.
What are the grace and timeout settings of the daily-sync and the other sub-monitors?
All the monitors have a grace period of 60 minutes and max runtime at 300 minutes:
.
I can try to make the values less loose and report back if the issue still reproduces.
Just an update that the job timed out today with the following threshold settings:
Let me know if there is anything I can do to help reproduce/resolve this.
Thanks for the update. Hm, indeed very strange. Currently I do not have even a hunch why this is happening.
All your daily tasks are Django manage commands and look the same. Is there anything in the Statsuser
tasks that is fundamentally different compared to the other tasks? Does it use async code for example, or anything else you can think of?
The only difference is that Statsuser
runs the longest for more than an hour while the other tasks are all less than 20minutes long.
Statuser
does use async, but in exactly the same way as the Crawler
command. I use the @async_to_sync
in both cases to get around the fact that Django Commands are sync only:
# Main entry point for a Django Command:
def handle(
self,
*args,
offset=0,
instances=None,
skip_inactive_for: int = 0,
pre_filter: bool = False,
**options,
):
self.main(
offset=offset,
instances=instances,
skip_inactive_for=skip_inactive_for,
pre_filter=pre_filter,
)
# My Async method that does all the work:
@async_to_sync
async def main(
self,
offset: int,
instances: str | None,
skip_inactive_for: int = 0,
pre_filter: bool = False,
):
...
I can try rewriting the code to not use async to see if async is the cause of this problem.
Even without any async code in Statuser
the monitor still failed today. 😔
Dang. And the monitor is marked as "Timed out" right? One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out. Because Sentry only sets it to timed out after the full 5 hours. If it is set to timed out before then something strange is going on.
And the monitor is marked as "Timed out" right?
Correct.
One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out.
I've just done that. It is in the in progress state while it's processing and stays in the in progress state when the job finishes:
So it stays "in progress" for the whole 5 hours and then is set to "failed"? Is this correct?
Yes, correct.
Ok, thanks for confirming. I still dont know what is going wrong. I have asked the team handling the checkins of crons in our backend, if they have any ideas what is going wrong...
Just wanted to check in, if you try with the newest Version of the SDK, does this still apply?
Yesterday, I reenabled the monitor that was timing out before, and today it looks like it passed:
Looks like the issue has been resolved, so I'm closing this, but I'll reopen if I see any more failures. Thank you for taking care of this!
Oh no, I am reopening the issue because it has just happened again:
I see from my logs that there were no errors and that the statuser job finished after about 30minutes:
Dec 18 01:05:08 raspberrypi bash[758000]: Running optimizer
Dec 18 01:05:08 raspberrypi bash[758000]: Running statuser
Dec 18 01:05:15 raspberrypi bash[758000]: Batch 0 done, sleeping for 90s
Dec 18 01:06:55 raspberrypi bash[758000]: Batch 1 done, sleeping for 90s
Dec 18 01:08:31 raspberrypi bash[758000]: Batch 2 done, sleeping for 90s
Dec 18 01:10:47 raspberrypi bash[758000]: Batch 3 done, sleeping for 90s
Dec 18 01:12:26 raspberrypi bash[758000]: Batch 4 done, sleeping for 90s
Dec 18 01:14:08 raspberrypi bash[758000]: Batch 5 done, sleeping for 90s
Dec 18 01:15:45 raspberrypi bash[758000]: Batch 6 done, sleeping for 90s
Dec 18 01:17:22 raspberrypi bash[758000]: Batch 7 done, sleeping for 90s
Dec 18 01:18:58 raspberrypi bash[758000]: Batch 8 done, sleeping for 90s
Dec 18 01:20:35 raspberrypi bash[758000]: Batch 9 done, sleeping for 90s
Dec 18 01:22:12 raspberrypi bash[758000]: Batch 10 done, sleeping for 90s
Dec 18 01:23:48 raspberrypi bash[758000]: Batch 11 done, sleeping for 90s
Dec 18 01:25:25 raspberrypi bash[758000]: Batch 12 done, sleeping for 90s
Dec 18 01:27:02 raspberrypi bash[758000]: Batch 13 done, sleeping for 90s
Dec 18 01:28:39 raspberrypi bash[758000]: Batch 14 done, sleeping for 90s
Dec 18 01:30:16 raspberrypi bash[758000]: Batch 15 done, sleeping for 90s
Dec 18 01:31:53 raspberrypi bash[758000]: Batch 16 done, sleeping for 90s
Dec 18 01:33:30 raspberrypi bash[758000]: Batch 17 done, sleeping for 90s
Dec 18 01:35:07 raspberrypi bash[758000]: Batch 18 done, sleeping for 90s
Dec 18 01:36:41 raspberrypi bash[758000]: Batch 19 done, sleeping for 90s
Dec 18 01:38:11 raspberrypi bash[758000]: Running dailystats
Dec 18 01:38:11 raspberrypi bash[758000]: Gathering daily stats
Dec 18 01:38:22 raspberrypi bash[758000]: All done! 🎉
How do you use Sentry?
Sentry Saas (sentry.io)
Version
1.31.0
Steps to Reproduce
Expected Result
No alerts when the job finishes without errors as seen by the logs:
Actual Result
The monitor sometimes fails on the job that takes more than 1hour:
Even though the monitor after it (
daily-sync-stats
) worked ok:And interestingly, the
daily-sync
monitor which wraps the whole job function also passed without problems.I initially only had the
daily-sync
monitor configured since I don't really need every step monitored, but when I only had a single monitor it failed consistently every day. With multiple monitors the failures are more intermittent.I'm also not sure if it's relevant, but the job function is part of a always running Django Command that looks like this: