NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Parser Fix]: distribution.contentUrl for Zenodo #129

Open gtsueng opened 3 months ago

gtsueng commented 3 months ago

Issue Name

distribution.contentUrl for Zenodo

Issue Description

The Zenodo parser currently does not appear to be parsing values for the distribution field. Based on a quick review of 10 Zenodo records on their site, Zenodo uses the following url format to enable access of the files available for download:

While this link is for the download all button on the Zenodo site instead of the link for each individual file download, it can still be parsed to the 'distribution.contentUrl' field.

Issue Example

Example Zenodo record on prod: https://data.niaid.nih.gov/resources?id=ZENODO_6983398 Same record in Zenodo: https://zenodo.org/records/6983398 file download url from record in Zenodo: https://zenodo.org/api/records/6983398/files-archive

Related WBS task

For internal use only. Assignee, please select the status of this issue

Status Description

No response

gtsueng commented 2 months ago

@jal347 can you double-check the url you used in the correction?

The data on staging has the following url format https://zenodo.org/record/{identifier}/files-archive <-- This is not correct.

The download urls are actually to their api: https://zenodo.org/api/records/{identifier}/files-archive

gtsueng commented 1 month ago

@jal347 I found a few issues when looking at the data in Staging:

The url to access the record does not work for many Zenodo records:

The content.url is broken for many records in spite of the base url being correct. The reason for this is linked to the above issue and has to do with whether or not a zenodo id is a canonical id or a versioned record id.

Cause of issue:

Potential solution:

Other observations: