GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
633 stars 100 forks source link

Add more fields to harvester 2.0 CKAN API payload to maintain metadata links and collection relationship #4847

Closed FuhuXia closed 1 week ago

FuhuXia commented 2 months ago

After harvester 2.0, packages are added to CKAN via API calls, not via ckanext-harvest's harvesting activity any more. We need to re-consider how to maintain the metadata links and collection relationship that are offered by legacy ckanext-harvest and other related extensions.

Metadata link

This is a block that display harvest object and harvest source info for each dataset. In order for this to show up in catalog-next, the API call payload need to include these three keys and their values in the extras field:

harvest_object_id harvest_source_id harvest_source_title

By doing this there is no change on CKAN (catalog-next) side to keep the metadata link block and harvest source related facet search. When user clicks to show harvest-object original metadata and harvest source details, we can redirect to the harvester 2.0 Flask app.

Collection

There is never a complete solution to handle DCAT collection relationship in CKAN. For example, there might be harvest errors and need multiple attempts to complete harvesting a datajson with collections. During the initial harvesting, the parent check is enforced before a child dataset can be harvested, but in any following reharvests parent dataset can be deleted and leaving all previously harvested children dataset orphaned.

My suggestions for collection_package_id: 1. Do not use parent ckan id as collection id. Use the combination of harvest-source-id+identifier (more on this later). This way children can be harvested reglardless parent dataset is present or not. 2. Parent dataset is not aware of its parenthood. We detect dataset's parethoold with a solr query when a dataset detail page is loaded. This means there wont be collection icon on the dataset listing page, it only show up on the detail page. This behavior is kind of in sync with what is in DCAT: Parent record is not aware of parenthood. When all children datasets are gone, parent record is just a regular record. 3. Use the combination of harvest-source-id+identifier as collection id, not harvest-source-name+identifier, or a hash value of it, making the collection id permanent and searchable. We can split the id into harvest-source-id and identifier and locate the dataset in CKAN search. We cant use identifier alone since identifier is only guaranteed to be unique on harvest source level.

10/21/2024 Update: Based on the team discussion, we will not set and pass the ids from the harvesting process. Instead, all information will be handled on the CKAN side, as it is already available there. (Details in https://github.com/GSA/data.gov/issues/4969)

FuhuXia commented 2 months ago

For the Metadata link field names, we can go with what is defined in ticket https://github.com/GSA/data.gov/issues/4856:

record_id
harvest_source_id
harvset_source_name
Jin-Sun-tts commented 2 weeks ago

Included three keys and values in the extras field,

record_id => harvest_object_id
harvest_source_id => harvest_source_id
harvset_source_name => harvest_source_title

Following metadata source block shows up to display harvest object and harvest source info: Image

jbrown-xentity commented 2 weeks ago

For context, here is the notes from our session planning on how to handle collections: https://docs.google.com/document/d/1xaWeIOaqgL1Qo6kmWm_S7QwOcoD4i19kw4IwxNLpGa4/edit