PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
20.62k stars 1.23k forks source link

Improve scheduled tasks reliability #12077

Closed yakkomajuri closed 1 year ago

yakkomajuri commented 1 year ago

Me and @hazzadous are tackling robustness of async handlers this sprint. That includes jobs (see #11784) and scheduled tasks.

When it comes to scheduled tasks, me and @hazzadous had a chat and came to a few conclusions.

First is that we will do #12076. This should already cover a key problem we have right now #11982.

However, beyond that, we can significantly improve reliability of this service with little effort by using the Graphile worker.

To get a better sense of the problem, @hazzadous wrote up a great description:

Show description
> > > Scheduler > > Drivers > > The scheduler runs tasks periodically, every day, hour, minute. It is obvious when we fail to run every day and run every hour, and this does happen with the current setup. > > Requirements > > 1. Run tasks every minute, hour, day without missing runs. e.g. it doesn't matter if every day task runs 10 minutes late, or every hour runs 5 minutes late, just that it has run. > > Constraints > > 1. We can't guarantee that a processes will be up at the time a job is meant to run. Crashes will happen. > 2. Even if a task does manage to start, it may fail to complete successfully. > 3. Dependencies have capacity constraints. If we end up, say, running all "run every minute" tasks from the last day, we need to ensure that we do not cause outages. > 4. Processes take some time to spin up, possible say 5 minutes > > Design > > Given that we can't rely on any processes being up at a specific time that a cron task is meant to be up, at the point that a process is eventually up we need to be able to deduce if a task should have been dispatched during the process being down. As such, we need to persist the runs of the last job, much as anacron saves the time of the last run. > > Given that processes take some time to spin up, it seems wise to increase the availability of the scheduler. To do so I would suggest we go with running multiple runners. For the availability of the database in which the previous run is stored, just PG will probably be ok (tm). > > To protect dependencies, suggest that we limit the number of "missed" task runs to 1, e.g. if we happen to have not run 5 "runEveryMinute" tasks due to the process being down, then we just run 1 task associated with this. > > For failures, we can implement retry logic (exponential backoff or similar).

To make our scheduled tasks service more reliable, we'll be using Graphile Worker's crontab functionality.

This allows us to make use of a service we already have in place and that has worked well in order to ensure better performance and robustness here.

Using Graphile Worker's distributed crontab functionality, we'll be able to:

yakkomajuri commented 1 year ago

done