Closed brunopacheco1 closed 5 months ago
During gather stage the harvester requests all the available resource guids from a source. Guids are generated as catalog=<link to fdp catalog>;dataset=<link to fdp dataset>
for a dataset and catalog=<link to fdp catalog>
for a catalog.
Then it queries CKAN database for guids harvested from the same source (that means datasets. will be considered different if you configure a harvester source, then delete it and then re-configure).
The query is
SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id
FROM harvest_object
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s
where %(harvest_source_id_1)s is the harvester source id of the current job.
based on these two lists of guids a harvest object is assigned with status "delete", "new" or "change".
to dele a dataset
On the gather stage
ids of datasets to delete are defined as
delete = guids_in_db - guids_in_harvest
where guids_in_db
- ids from the harvester
table for a given source, a result of the query above.
On the gather
stage the harvester sets the status of datasets to delete to 'current': False
first. So they persist in the database but are not shown. Then the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id})
during the import stage
.
Caveat:
import stage
a dataset stays forever as no more current.DCTERMS.isPartOf
reference on the dataset level) then it will be considered a new dataset because a guid of CKAN unlike FDP includes a catalogue id as well (so CKAN harvested resource guid for a catalog will be catalog=<fdp link to catalog>
where link to catalog
is an FDP catalogue reference URL e.g. https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0
and a guid of a child dataset will be catalog=<lFDP ink to catalog>;dataset=<FDP link to dataset>
where the link to dataset is an FDP reference URL. e.g. for the following dataset https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25 CKAN harvester guid will be catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25
Also as per documentation it is possible to avoid updating certain fields: https://github.com/ckan/ckanext-harvest?tab=readme-ov-file#avoid-overwriting-certain-fields-optional
🎯 What? (Story Description)
We need to understand how the harvester handles duplication and updates.
💡 Why? (Justification)
So we don't have surprises or unexpected behaiour.
🔨 Tasks (Breakdown)
gdi-userportal-docs
.✅ Acceptance Criteria
âž• Additional Information
No response