Monitor intermitten failures on long running tasks

anze3db commented 1 year ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.31.0

Steps to Reproduce

Have a job that runs for over 1hour
Set up a Cron Monitor with 300mins of run threshold

Use the monitor in the code:

@monitor(monitor_slug="daily-sync")
def job(self):
    self.console.print("Starting daily job")
    self.console.print("Running crawler")
    with monitor(monitor_slug="daily-sync-crawler"):
        management.call_command("crawler", skip_inactive_for=3, pre_filter=True)
    self.console.print("Running indexer")
    with monitor(monitor_slug="daily-sync-indexer"):
        management.call_command("indexer")
    self.console.print("Running optimizer")
    with monitor(monitor_slug="daily-sync-optimizer"):
        management.call_command("optimizer")
    self.console.print("Running statuser")
    with monitor(monitor_slug="daily-sync-statuser"):
        management.call_command("statuser")
    self.console.print("Running dailystats")
    with monitor(monitor_slug="daily-sync-stats"):
        management.call_command("dailystats")
    self.console.print("All done! 🎉")

Expected Result

No alerts when the job finishes without errors as seen by the logs:

Oct 08 02:00:00 raspberrypi bash[405]: Starting daily job
Oct 08 02:00:00 raspberrypi bash[405]: Running crawler
Oct 08 02:07:00 raspberrypi bash[405]: Running optimizer
Oct 08 02:21:31 raspberrypi bash[405]: Running statuser
Oct 08 03:29:49 raspberrypi bash[405]: Running dailystats
Oct 08 03:29:49 raspberrypi bash[405]: Gathering daily stats
Oct 08 03:29:50 raspberrypi bash[405]: All done! 🎉

Actual Result

The monitor sometimes fails on the job that takes more than 1hour:

Screenshot 2023-10-08 at 09 33 14

Even though the monitor after it (daily-sync-stats) worked ok:

Screenshot 2023-10-08 at 09 35 11

And interestingly, the daily-sync monitor which wraps the whole job function also passed without problems.

I initially only had the daily-sync monitor configured since I don't really need every step monitored, but when I only had a single monitor it failed consistently every day. With multiple monitors the failures are more intermittent.

I'm also not sure if it's relevant, but the job function is part of a always running Django Command that looks like this:

    def handle(self, *args, run_now=False, **options):
        self.console.print("Starting scheduler 🕐")
        schedule.every().day.at("01:00").do(self.job)

        while True:
            schedule.run_pending()
            time.sleep(1)

antonpirker commented 1 year ago

Hey @anze3db ! Thanks for reporting this. Indeed this should not happen. The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

anze3db commented 1 year ago

The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

Yes, I agree. I just thought I'd mention that the wrapper monitor passed. I think this should rule out any errors in my function.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

All the monitors have a grace period of 60 minutes and max runtime at 300 minutes:

Screenshot 2023-10-09 at 09 23 49 .

I can try to make the values less loose and report back if the issue still reproduces.

anze3db commented 1 year ago

Just an update that the job timed out today with the following threshold settings:

Screenshot 2023-10-10 at 14 24 54

Screenshot 2023-10-10 at 14 30 36

Let me know if there is anything I can do to help reproduce/resolve this.

antonpirker commented 1 year ago

Thanks for the update. Hm, indeed very strange. Currently I do not have even a hunch why this is happening. All your daily tasks are Django manage commands and look the same. Is there anything in the Statsuser tasks that is fundamentally different compared to the other tasks? Does it use async code for example, or anything else you can think of?

anze3db commented 1 year ago

The only difference is that Statsuser runs the longest for more than an hour while the other tasks are all less than 20minutes long.

Statuser does use async, but in exactly the same way as the Crawler command. I use the @async_to_sync in both cases to get around the fact that Django Commands are sync only:

    # Main entry point for a Django Command:
    def handle(
        self,
        *args,
        offset=0,
        instances=None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
        **options,
    ):
        self.main(
            offset=offset,
            instances=instances,
            skip_inactive_for=skip_inactive_for,
            pre_filter=pre_filter,
        )

    # My Async method that does all the work:
    @async_to_sync
    async def main(
        self,
        offset: int,
        instances: str | None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
    ):
         ...

I can try rewriting the code to not use async to see if async is the cause of this problem.

anze3db commented 1 year ago

Even without any async code in Statuser the monitor still failed today. 😔

antonpirker commented 1 year ago

Dang. And the monitor is marked as "Timed out" right? One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out. Because Sentry only sets it to timed out after the full 5 hours. If it is set to timed out before then something strange is going on.

anze3db commented 1 year ago

And the monitor is marked as "Timed out" right?

Correct.

One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out.

I've just done that. It is in the in progress state while it's processing and stays in the in progress state when the job finishes:

Screenshot 2023-10-12 at 13 22 26

antonpirker commented 1 year ago

So it stays "in progress" for the whole 5 hours and then is set to "failed"? Is this correct?

anze3db commented 1 year ago

Yes, correct.

antonpirker commented 1 year ago

Ok, thanks for confirming. I still dont know what is going wrong. I have asked the team handling the checkins of crons in our backend, if they have any ideas what is going wrong...

antonpirker commented 11 months ago

Just wanted to check in, if you try with the newest Version of the SDK, does this still apply?

anze3db commented 11 months ago

Yesterday, I reenabled the monitor that was timing out before, and today it looks like it passed:

Screenshot 2023-12-16 at 10 40 32

Looks like the issue has been resolved, so I'm closing this, but I'll reopen if I see any more failures. Thank you for taking care of this!

anze3db commented 11 months ago

Oh no, I am reopening the issue because it has just happened again:

Screenshot 2023-12-18 at 13 19 04

I see from my logs that there were no errors and that the statuser job finished after about 30minutes:

Dec 18 01:05:08 raspberrypi bash[758000]: Running optimizer
Dec 18 01:05:08 raspberrypi bash[758000]: Running statuser
Dec 18 01:05:15 raspberrypi bash[758000]: Batch 0 done, sleeping for 90s
Dec 18 01:06:55 raspberrypi bash[758000]: Batch 1 done, sleeping for 90s
Dec 18 01:08:31 raspberrypi bash[758000]: Batch 2 done, sleeping for 90s
Dec 18 01:10:47 raspberrypi bash[758000]: Batch 3 done, sleeping for 90s
Dec 18 01:12:26 raspberrypi bash[758000]: Batch 4 done, sleeping for 90s
Dec 18 01:14:08 raspberrypi bash[758000]: Batch 5 done, sleeping for 90s
Dec 18 01:15:45 raspberrypi bash[758000]: Batch 6 done, sleeping for 90s
Dec 18 01:17:22 raspberrypi bash[758000]: Batch 7 done, sleeping for 90s
Dec 18 01:18:58 raspberrypi bash[758000]: Batch 8 done, sleeping for 90s
Dec 18 01:20:35 raspberrypi bash[758000]: Batch 9 done, sleeping for 90s
Dec 18 01:22:12 raspberrypi bash[758000]: Batch 10 done, sleeping for 90s
Dec 18 01:23:48 raspberrypi bash[758000]: Batch 11 done, sleeping for 90s
Dec 18 01:25:25 raspberrypi bash[758000]: Batch 12 done, sleeping for 90s
Dec 18 01:27:02 raspberrypi bash[758000]: Batch 13 done, sleeping for 90s
Dec 18 01:28:39 raspberrypi bash[758000]: Batch 14 done, sleeping for 90s
Dec 18 01:30:16 raspberrypi bash[758000]: Batch 15 done, sleeping for 90s
Dec 18 01:31:53 raspberrypi bash[758000]: Batch 16 done, sleeping for 90s
Dec 18 01:33:30 raspberrypi bash[758000]: Batch 17 done, sleeping for 90s
Dec 18 01:35:07 raspberrypi bash[758000]: Batch 18 done, sleeping for 90s
Dec 18 01:36:41 raspberrypi bash[758000]: Batch 19 done, sleeping for 90s
Dec 18 01:38:11 raspberrypi bash[758000]: Running dailystats
Dec 18 01:38:11 raspberrypi bash[758000]: Gathering daily stats
Dec 18 01:38:22 raspberrypi bash[758000]: All done! 🎉

getsentry / sentry-python