Closed iccaprar closed 4 months ago
We implemented a local fix for this, which we can submit to the repo:
AfterCronJobCrashAbortEvent
that is sent by the CronJobManager
Since we have a single DB migration possible, it works for us.
An even better approach would be to add on the MigrationCronJob
a new attribute with the migrationId
, so when we get the event that the cronjob was aborted, we can cancel only the needed migration task.
We've recently fixed some newly discovered issues regarding resuming of failed migration (#14), main issue was due to logic fetching pending failed copy tasks, with condition of cluster node ID, which was resulting invalid or even empty results, especially after node restart (if there are no fixed cluster ID assigned).
If you already have something implemented to handle cronjob restart case, please either create a PR here, or point me to your fork (if possible) where you added such change.
Adding migration ID to cronjob model data might be obviously quite useful in multiple cases, starting with proper abort/failure handling. We could also attach migration report log to cronjob execution log, or at least reference from job, to report download location via Backoffice UI component or something similar.
We have encountered the following situation when running commerce-db-sync on CCv2:
CronJobManager#getRunningOrRestartedCronJobsForNode
sets the migration cronjob to ABORTEDOn further investigation, we found out that the incremental job implementation is checking if there is an existing running migration and if yes, it just starts waiting for that to finish. But that running migration is not actually running, as the thread running it was long gone with the node restart.
When we run some DB checks, this is the data we find:
We have to go and manually change the status in these tables to error:
Then when we start the cronjob, it does not find a running task, it creates a new on and works correctly.