GenomicDataInfrastructure / gdi-userportal-ckanext-fairdatapoint

0 stars 1 forks source link

[gdi - chanext-fairdatapoint - SPIKE] What is the update strategy applied by the harvesters? #19

Closed brunopacheco1 closed 5 months ago

brunopacheco1 commented 6 months ago

🎯 What? (Story Description)

We need to understand how the harvester handles duplication and updates.

💡 Why? (Justification)

So we don't have surprises or unexpected behaiour.

🔨 Tasks (Breakdown)

✅ Acceptance Criteria

âž• Additional Information

No response

a-nayden commented 6 months ago

During gather stage the harvester requests all the available resource guids from a source. Guids are generated as catalog=<link to fdp catalog>;dataset=<link to fdp dataset> for a dataset and catalog=<link to fdp catalog> for a catalog. Then it queries CKAN database for guids harvested from the same source (that means datasets. will be considered different if you configure a harvester source, then delete it and then re-configure). The query is

SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id 
FROM harvest_object 
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s

where %(harvest_source_id_1)s is the harvester source id of the current job.

based on these two lists of guids a harvest object is assigned with status "delete", "new" or "change".

to dele a dataset On the gather stage ids of datasets to delete are defined as delete = guids_in_db - guids_in_harvest where guids_in_db - ids from the harvester table for a given source, a result of the query above.

On the gather stage the harvester sets the status of datasets to delete to 'current': False first. So they persist in the database but are not shown. Then the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id}) during the import stage.

Caveat:

a-nayden commented 6 months ago

Also as per documentation it is possible to avoid updating certain fields: https://github.com/ckan/ckanext-harvest?tab=readme-ov-file#avoid-overwriting-certain-fields-optional