[gdi - chanext-fairdatapoint - SPIKE] What is the update strategy applied by the harvesters?

brunopacheco1 commented 6 months ago

🎯 What? (Story Description)

We need to understand how the harvester handles duplication and updates.

💡 Why? (Justification)

So we don't have surprises or unexpected behaiour.

🔨 Tasks (Breakdown)

Understand how Harvester handles duplication and what is the update strategy behind it.
Document in gdi-userportal-docs.
Create stories for fixing or improving, if needed.

✅ Acceptance Criteria

Are the findings documented?
Are there new stories to work on gaps or bugs?

➕ Additional Information

No response

a-nayden commented 6 months ago

During gather stage the harvester requests all the available resource guids from a source. Guids are generated as catalog=<link to fdp catalog>;dataset=<link to fdp dataset> for a dataset and catalog=<link to fdp catalog> for a catalog. Then it queries CKAN database for guids harvested from the same source (that means datasets. will be considered different if you configure a harvester source, then delete it and then re-configure). The query is

SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id 
FROM harvest_object 
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s

where %(harvest_source_id_1)s is the harvester source id of the current job.

based on these two lists of guids a harvest object is assigned with status "delete", "new" or "change".

to dele a dataset On the gather stage ids of datasets to delete are defined as delete = guids_in_db - guids_in_harvest where guids_in_db - ids from the harvester table for a given source, a result of the query above.

On the gather stage the harvester sets the status of datasets to delete to 'current': False first. So they persist in the database but are not shown. Then the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id}) during the import stage.

Caveat:

If something goes wrong during actual deletion on the import stage a dataset stays forever as no more current.
what can be (potentially) an issue: if you move a dataset from a catalogue to another catalogue in FDP (by updating DCTERMS.isPartOf reference on the dataset level) then it will be considered a new dataset because a guid of CKAN unlike FDP includes a catalogue id as well (so CKAN harvested resource guid for a catalog will be catalog=<fdp link to catalog> where link to catalog is an FDP catalogue reference URL e.g. https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0 and a guid of a child dataset will be catalog=<lFDP ink to catalog>;dataset=<FDP link to dataset> where the link to dataset is an FDP reference URL. e.g. for the following dataset https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25 CKAN harvester guid will be catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25

a-nayden commented 6 months ago

Also as per documentation it is possible to avoid updating certain fields: https://github.com/ckan/ckanext-harvest?tab=readme-ov-file#avoid-overwriting-certain-fields-optional

GenomicDataInfrastructure / gdi-userportal-ckanext-fairdatapoint