GetDKAN / dkan

DKAN Open Data Portal
https://dkan.readthedocs.io/en/latest/index.html
GNU General Public License v2.0
374 stars 172 forks source link

Converting runs from old harvest_ID_runs to harvest_runs fails if duplicate time stamps #4287

Open swirtSJW opened 2 months ago

swirtSJW commented 2 months ago

Current Behavior

When updating to DKAN 2.19 from a lower version, if two different harvest_ID_runs tables happen to have the same timestamp a sql error is thrown because the timestamp is treated as the unique identifier. This is unlikely since it is only a span of 1 second, but it is possible to encounter in the wild.

> >  [notice] Converting runs for home_health__data
> >  [error]  Drupal\Core\Database\IntegrityConstraintViolationException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  in Drupal\mysql\Driver\Database\mysql\ExceptionHandler->handleExecutionException() (line 45 of /var/www/html/docroot/core/modules/mysql/src/Driver/Database/mysql/ExceptionHandler.php). 
> >  [error]  SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1599570670' for key 'PRIMARY': INSERT INTO "harvest_runs" ("id", "harvest_plan_id", "data", "extract_status") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3); Array
> > (
> >     [:db_insert_placeholder_0] => 1599570670
> >     [:db_insert_placeholder_1] => home_health__data
> >     [:db_insert_placeholder_2] => {"plan":"{\"identifier\":\"home_health__data\",\"extract\":{\"type\":\"\\\\Drupal\\\\pqdc\\\\Harvest\\\\ETL\\\\Extract\\\\DataJson\",\"uri\":\"file:\\\/\\\/\\\/mnt\\\/tmp\\\/data.json\"},\"transforms\":[],\"load\":{\"type\":\"\\\\Drupal\\\\harvest\\\\Load\\\\Dataset\"}}","status":[],"errors":{"extract":"Error decoding JSON."}}
> >     [:db_insert_placeholder_3] => FAILURE
> > )
> >  
> >  [error]  Update failed: harvest_update_8008 
> >  [notice] Update started: metastore_update_8009
> >  [notice] Updated 0 dictionaries. If you have overridden DKAN's core schemas,
> >     you must update your site's data dictionary schema after this update. Copy
> >     modules/contrib/dkan/schema/collections/data-dictionary.json over you local
> >     site version before attempting to read or write any data dictionaries.
> >  [notice] Update completed: metastore_update_8009
> >  [notice] Update started: metastore_admin_update_8012
> >  [notice] Update completed: metastore_admin_update_8012
>  [error]  Update aborted by: harvest_update_8008 
>  [error]  Finished performing updates. 

Expected Behavior

The migration of data from one table to another should happen without error.

Steps To Reproduce

  1. Have a data setup where you have harvest_ID_runs that contain the same timestamps.
  2. run update.php, or drush upddb or drush dkan:harvest:update
  3. See errors

Relevant log output (optional)

No response

Anything else?

This may be too unlikely a scenerio to add a try catch block HarvestUtility::convertRunTable() but I will at least provide a drush sqlc command to undo any duplicated IDs.

Discussion in CA Slack

swirtSJW commented 2 months ago

Is there any issue with just incrementing the timestamp by 1 second (hoping it would make it then be unique)? Or is that timestamp used as a unique ID other places in the system?

swirtSJW commented 2 months ago

It turns out that bumping the timestamp will likely disconnect the harvest run from all other things. This also means it could not be addressed with a try catch.

The new hope is that some variation of this might work:

The HarvestRunRepository()::loadEntity() function treats the id AND the harvest_plan_id as the combined key to look up the harvest run entity.

So if the id were not the key for the table, then there would be no issue with the id needing to be unique. The only risk would be if two harvest runs from the same plan took place at the same time.

paul-m commented 3 weeks ago

I think the easiest solution to implement would be to add another column which is the actual ID (maybe a uuid), and use that as the unique key. Leave everything else in place, and provide an update path to the new entity schema for both old style harvest_id_run tables and the newer entity tables.