Migration tasks cannot restart when platform nodes crash

iccaprar commented 10 months ago

We have encountered the following situation when running commerce-db-sync on CCv2:

a migration task was in progress
platform node crashed and was restarted by the CCv2 orchestrator
on platform restart, CronJobManager#getRunningOrRestartedCronJobsForNode sets the migration cronjob to ABORTED
we tried to restart the cronjob from Backoffice (but a trigger ends up in the same situation)
the cronjob starts but never does anything, just waits forever for a sync task to complete

On further investigation, we found out that the incremental job implementation is checking if there is an existing running migration and if yes, it just starts waiting for that to finish. But that running migration is not actually running, as the thread running it was long gone with the node restart.

When we run some DB checks, this is the data we find:

SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYSTATUS

migrationId startAt endAt   lastUpdate  total   completed   failed  status
b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7    2023-11-14 12:00:48.0833333     2023-11-14 12:00:50.616 1   0   0   RUNNING

SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYTASKS

targetnodeId    migrationId pipelinename    sourcetablename targettablename columnmap   duration    sourcerowcount  targetrowcount  failure error   published   truncated   lastupdate  avgwriterrowthroughput  avgreaderrowthroughput  copymethod  keycolumns  durationinseconds
13  b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7    sapbwentry->SAPBWENTRY  sapbwentry  SAPBWENTRY  {}      9183    0   0       0   0   2023-11-14 12:00:50.616 0.00    0.00            0.00

We have to go and manually change the status in these tables to error:

UPDATE MIGRATIONTOOLKIT_TABLECOPYTASKS SET failure=1 WHERE migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';

UPDATE MIGRATIONTOOLKIT_TABLECOPYSTATUS set status='ABORTED' where migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';

Then when we start the cronjob, it does not find a running task, it creates a new on and works correctly.

iccaprar commented 10 months ago

We implemented a local fix for this, which we can submit to the repo:

we listen for AfterCronJobCrashAbortEvent that is sent by the CronJobManager
when we get this event, we look for running tasks and mark them as aborted like above

Since we have a single DB migration possible, it works for us.

An even better approach would be to add on the MigrationCronJob a new attribute with the migrationId, so when we get the event that the cronjob was aborted, we can cancel only the needed migration task.

lnowakowski commented 9 months ago

We've recently fixed some newly discovered issues regarding resuming of failed migration (#14), main issue was due to logic fetching pending failed copy tasks, with condition of cluster node ID, which was resulting invalid or even empty results, especially after node restart (if there are no fixed cluster ID assigned).

If you already have something implemented to handle cronjob restart case, please either create a PR here, or point me to your fork (if possible) where you added such change.

Adding migration ID to cronjob model data might be obviously quite useful in multiple cases, starting with proper abort/failure handling. We could also attach migration report log to cronjob execution log, or at least reference from job, to report download location via Backoffice UI component or something similar.

SAP / sap-commerce-db-sync

Migration tasks cannot restart when platform nodes crash #13