Shopify / maintenance_tasks

A Rails engine for queueing and managing data migrations.
MIT License
1.05k stars 76 forks source link

Ensure the cursor updates and the task resumes from the point of termination when Sidekiq stops the task midway. #1047

Closed hatsu38 closed 4 months ago

hatsu38 commented 4 months ago

Summary

I am running MaintenanceTasks with Sidekiq. But Sidekiq sometimes crashes unexpectedly.

This issue aims to ensure that tasks can resume from the cursor position at the time of termination, even when Sidekiq or other processes are abruptly terminated with SIGKILL.

Details

Here is an example of the data stored in the maintenance_tasks_runs table:

MaintenanceTasks::Run.last
=> #<MaintenanceTasks::Run:0x000000013e45ed88
 id: 9,
 task_name: "Maintenance::ForDiveIntoCodeTask",
 started_at: Sun, 07 Jul 2024 23:54:53.000000000 JST +09:00,
 ended_at: nil,
 time_running: 233.019,
 tick_count: 5,
 tick_total: 148,
 job_id: "17d2859b-f3f0-4a8b-8e20-6c94292cd11a",
 cursor: 12,
 status: "running",
 error_class: nil,
 error_message: nil,
 backtrace: nil,
 created_at: Sun, 07 Jul 2024 23:54:26.317306000 JST +09:00,
 updated_at: Mon, 08 Jul 2024 00:02:37.584857000 JST +09:00,
 arguments: {"dryrun"=>"1"},
 lock_version: 9,
 metadata: nil>

Current Behavior

The cursor is used to indicate the position from which a task should be resumed after being interrupted. However, the current implementation relies on the shutdown callback to update the cursor.

If Sidekiq is forcibly terminated or an exception occurs, this callback does not execute, and the cursor remains outdated. Consequently, the task resumes from an old cursor position instead of the latest position.

Requested Change

I thought I could solve this problem by executing @run.cursor = cursor_position within the each_iteration method.

etiennebarrie commented 4 months ago

This is expected, as mentioned in the README:

Idempotency of Task#process: it should be safe to run process multiple times for the same element of the collection. Read more in this Sidekiq best practice. It’s important if the Task errors and you run it again, because the same element that caused the Task to give an error may well be processed again. It especially matters in the situation described above, when the iteration duration exceeds the timeout: if the job is re-enqueued, multiple elements may be processed again. https://github.com/Shopify/maintenance_tasks/blob/v2.7.1/README.md#considerations-when-writing-tasks

But really you should investigate why your Sidekiq processes are forcibly terminated. Typically you want to send SIGTERM and have a grace period longer than the time to process a single element. That way when the element is processed, the job notices the worker is shutting down, and stops iterating, recording the cursor, and re-enqueues itself. When a worker picks up the job, the task is resumed exactly where it stopped.