fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
616 stars 74 forks source link

DOI registry assumes md5 hashing algorithm #435

Open ionathan opened 3 weeks ago

ionathan commented 3 weeks ago

Description of the problem:

While trying to load a registry from a DOI of dataverse.nl, I realized that they use SHA1. In pooch the hash algorithm is "fixed" to md5.

Full code that generated the error

import pooch
example = pooch.create(
    path=pooch.os_cache("example"),
    base_url="doi:10.34894/5SOKTV",
)
example.load_registry_from_doi()

Full error message

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 example.load_registry_from_doi()

File /usr/local/lib/python3.11/site-packages/pooch/core.py:704, in Pooch.load_registry_from_doi(self)
    701 repository = doi_to_repository(doi)
    703 # Call registry population for this repository
--> 704 return repository.populate_registry(self)

File /usr/local/lib/python3.11/site-packages/pooch/downloaders.py:1162, in DataverseRepository.populate_registry(self, pooch)
   1151 """
   1152 Populate the registry using the data repository's API
   1153 
   (...)
   1157     The pooch instance that the registry will be added to.
   1158 """
   1160 for filedata in self.api_response.json()["data"]["latestVersion"]["files"]:
   1161     pooch.registry[filedata["dataFile"]["filename"]] = (
-> 1162         f"md5:{filedata['dataFile']['md5']}"
   1163     )

KeyError: 'md5'
dokempf commented 3 weeks ago

When I wrote this, I was not aware of the fact that DataVerse uses different checksum implementations. I agree this should be fixed, but in order to do it properly, we should first find out the full picture of how DataVerse handles checksums.

dokempf commented 1 day ago

Apparently, DataVerse can be configured to work with one of four hashing algorithms: MD5, SHA-1, SHA-256, and SHA-512 Source. There is an API route to check which one is in use, but it is only intended for upload, it does not give a guarantee about what checksums might be present on existing data. I therefore think our best bet is to iterate through a hard-coded list of keys until we find one that is present in the API response.