Automattic / Cron-Control

A fresh take on running WordPress's cron system, allowing parallel processing
https://wordpress.org/plugins/wp-cron-control/
GNU General Public License v2.0
121 stars 21 forks source link

Task: Re-examine the event locks implementation #233

Open WPprodigy opened 2 years ago

WPprodigy commented 2 years ago

1) Prevent OOM requests from perpetually consuming a lock slot.

Problem example: There are 10 lock slots for concurrent event running. Event A OOMs and the process is killed off higher up the chain than at the request-level. The plugin never gets to free up the lock, so now the site-level lock is perpetually stuck with just 9 slots. This slowly gets worse and worse if OOMing events keep running every once in a while. Eventually, maybe, all 10 slots would be deadlocked and then the cache key will expire since it is no longer being updated. But this can take a while, and cron becomes very backed up in the process.

Possible solution: Mentioned here, instead of the lock being 1 cache key with incr/decr, each lock slot could be it's own key w/ it's own timestamp. This way we can free up locks individually when problems happen.

Another possible solution: Use a DB table for locks instead - https://github.com/Automattic/Cron-Control/issues/84

Another possible solution: Have the Go runner catch a kill signal and it could send a subsequent cron-control cleanup event lock in those cases. Downside being it's an extra request, and only solves for the go runner, but is a way to at least be notified about when this happens (vs relying on a cache timeout that may or may not be an accurate representation of how long an event can run successfully).

2) Increase lock defaults

3) Drop locks completely?

Following the chain of thought from the point above, maybe we just drop the site-level concurrency locks completely. And event-level concurrency would default to "unlimited", but keep a simplified implementation that is used for events where concurrency is specifically disallowed.

WPprodigy commented 2 years ago

An additional problem worth noting / solving:

4) The same exact event can be run multiple times concurrently.

At the moment we complete/reschedule an event right before running it. This is what core WP does, but problem is that core WP doesn't do concurrent event running. So this causes problems where the same exact event is run concurrently multiple times. Two examples:

Example 1: Say an event has a 60 second recurring schedule but can run for up to say 2 minutes. So cron control will pick up the event, reschedule it for +60 seconds, and start processing it. In a minute the event is due again, but the first event is still running. Cron control will pick it up again and run it as well. Now we have the same event running twice.

Example 2: The application repeatedly schedules a single event to run if it's not already scheduled with something like this:

if ( ! wp_next_scheduled( self::CRON_JOB_NAME, [ $process_id ] ) ) {
  wp_schedule_single_event( $now, self::CRON_JOB_NAME, [ $process_id ] );
}

At the moment, wp_next_scheduled() will fail even if the event is currently processing.

Solution: Set events to the running status when they are actively running. This will both prevent list-due-batch from returning events that are still processing, and will help prevent events from being registered when their twin is still running.

WPprodigy commented 2 years ago

Did a lot of testing and thinking about all the above. And I think we can solve pretty much all of the problems by setting events to the running status during processing. Here is what I'm thinking:

Mark events as "running" during event processing

Pretty straightforward idea. When an event is picked up to be run, we can do a query like this: UPDATE table SET status = 'running' WHERE ID = 123 AND status = 'pending'.

Event/action-level concurrency locks could be simplified to either allowed or disallowed concurrency.

Instead of whitelisting actions with a number of how many allowed concurrent runs it can have, just have a boolean to signify "allowed_concurrency" or "disallowed_concurrency". We can keep backwards compatibly by checking the current whitelist filter and if the number is greater than 1 then the action has "allowed_concurrency". This keeps the promise that there is no same-action event concurrency by default, but events marked as "concurrent safe" no longer need the lock.

Site-level concurrency can also be simplified (or removed)

With events marked as running when they are processing, we'll just need to query for events that have that status. It will give us the count of events currently processing. So if count( $currently_running_events ) >= SITE_CONCURRENCY_LIMIT, we can just abort.

Ultimately though, I still think we should remove the site-level lock entirely. The runner is already responsible for how many runWorkers are available.