fatiando / pooch

A friend to fetch your data files
616 stars 74 forks source link

DOI registry assumes md5 hashing algorithm #435

Open ionathan opened 3 weeks ago

ionathan commented 3 weeks ago

Description of the problem:

While trying to load a registry from a DOI of dataverse.nl, I realized that they use SHA1. In pooch the hash algorithm is "fixed" to md5.

Full code that generated the error

import pooch
example = pooch.create(

Full error message

KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 example.load_registry_from_doi()

File /usr/local/lib/python3.11/site-packages/pooch/core.py:704, in Pooch.load_registry_from_doi(self)
    701 repository = doi_to_repository(doi)
    703 # Call registry population for this repository
--> 704 return repository.populate_registry(self)

File /usr/local/lib/python3.11/site-packages/pooch/downloaders.py:1162, in DataverseRepository.populate_registry(self, pooch)
   1151 """
   1152 Populate the registry using the data repository's API
   1157     The pooch instance that the registry will be added to.
   1158 """
   1160 for filedata in self.api_response.json()["data"]["latestVersion"]["files"]:
   1161     pooch.registry[filedata["dataFile"]["filename"]] = (
-> 1162         f"md5:{filedata['dataFile']['md5']}"
   1163     )

KeyError: 'md5'
dokempf commented 3 weeks ago

When I wrote this, I was not aware of the fact that DataVerse uses different checksum implementations. I agree this should be fixed, but in order to do it properly, we should first find out the full picture of how DataVerse handles checksums.

dokempf commented 1 day ago

Apparently, DataVerse can be configured to work with one of four hashing algorithms: MD5, SHA-1, SHA-256, and SHA-512 Source. There is an API route to check which one is in use, but it is only intended for upload, it does not give a guarantee about what checksums might be present on existing data. I therefore think our best bet is to iterate through a hard-coded list of keys until we find one that is present in the API response.