Converting runs from old harvest_ID_runs to harvest_runs fails if duplicate time stamps

swirtSJW commented 2 days ago

Current Behavior

When updating to DKAN 2.19 from a lower version, if two different harvest_ID_runs tables happen to have the same timestamp a sql error is thrown because the timestamp is treated as the unique identifier. This is unlikely since it is only a span of 1 second, but it is possible to encounter in the wild.

> >  [notice] Converting runs for home_health__data
> >  [error]  Drupal\Core\Database\IntegrityConstraintViolationException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  in Drupal\mysql\Driver\Database\mysql\ExceptionHandler->handleExecutionException() (line 45 of /var/www/html/docroot/core/modules/mysql/src/Driver/Database/mysql/ExceptionHandler.php). 
> >  [error]  SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  
> >  [error]  Update failed: harvest_update_8008 
> >  [notice] Update started: metastore_update_8009
> >  [notice] Updated 0 dictionaries. If you have overridden DKAN's core schemas,
> >     you must update your site's data dictionary schema after this update. Copy
> >     modules/contrib/dkan/schema/collections/data-dictionary.json over you local
> >     site version before attempting to read or write any data dictionaries.
> >  [notice] Update completed: metastore_update_8009
> >  [notice] Update started: metastore_admin_update_8012
> >  [notice] Update completed: metastore_admin_update_8012
>  [error]  Update aborted by: harvest_update_8008 
>  [error]  Finished performing updates.

Expected Behavior

The migration of data from one table to another should happen without error.

Steps To Reproduce

Have a data setup where you have harvest_ID_runs that contain the same timestamps.
run update.php, or drush upddb or drush dkan:harvest:update
See errors

Relevant log output (optional)

No response

Anything else?

This may be too unlikely a scenerio to add a try catch block HarvestUtility::convertRunTable() but I will at least provide a drush sqlc command to undo any duplicated IDs.

Discussion in CA Slack

swirtSJW commented 2 days ago

Is there any issue with just incrementing the timestamp by 1 second (hoping it would make it then be unique)? Or is that timestamp used as a unique ID other places in the system?

swirtSJW commented 2 days ago

It turns out that bumping the timestamp will likely disconnect the harvest run from all other things. This also means it could not be addressed with a try catch.

The new hope is that some variation of this might work:

Change the harvest_runs schema to NOT have the id be a key.

The HarvestRunRepository()::loadEntity() function treats the id AND the harvest_plan_id as the combined key to look up the harvest run entity.

So if the id were not the key for the table, then there would be no issue with the id needing to be unique. The only risk would be if two harvest runs from the same plan took place at the same time.

GetDKAN / dkan