Unable to download files in Dataverse repositories when files don't have PIDs

jggautier commented 1 year ago

Description of the problem:

Hi! I was happy to hear that pooch has support for downloading files in repositories that use Dataverse. I played around with the commands a bit today.

The pooch.create and DOIDownloader functions work great when the files I want to download in Dataverse repositories have persistent identifiers. It looks like both functions assume that files in Dataverse repositories will have PIDs.

But many repositories using Dataverse don't register PIDs for their files (see a conversation in the Dataverse Google Group about which repositories do and don't). We can't assume that all files in Dataverse repositories will have PIDs, and it's likely that more repositories using Dataverse will "turn off" PID registration for their files.

So when I try to download a file that doesn't have a PID, I get an error.

Full code that generated the error

Using pooch.create:

data = pooch.create(
    base_url='doi:10.7910/DVN/DCDKZQ',
    path=pooch.os_cache('testing'))
data.load_registry_from_doi()
datafile = data.fetch(
    fname='Indata_2022.10.02_19.44.51.zip')

Using DOIDownloader:

downloader = DOIDownloader()
url = 'doi:10.7910/DVN/DCDKZQ/Indata_2022.10.02_19.44.51.zip'
downloader(
    url=url, 
    output_file='/testing/Indata_2022.10.02_19.44.51.zip',
    pooch=None)

Full error message

Traceback (most recent call last):
  File "/testing_pooch.py", line 28, in <module>
    downloader(
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 611, in __call__
    downloader(download_url, output_file, pooch)
  File "/usr/local/lib/python3.9/site-packages/pooch/downloaders.py", line 207, in __call__
    response.raise_for_status()
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=
[Finished in 2.9s with exit code 1]

System information

Operating system: Mac OS 13.2.1
Python installation (Anaconda, system, ETS): System (I think)
Version of Python: 3.9.12
Version of this package: v1.7.0

jggautier commented 1 year ago

While not all files in Dataverse repositories have PIDs, they always have database IDs, so the Dataverse API endpoint for accessing files can use the file's database ID instead.

I think the download_url function at https://github.com/fatiando/pooch/blob/a965902d26015453ac00269597a23b83d85db644/pooch/downloaders.py#L1022 tries to get a file's PID to create a download url for the file. And when the file doesn't have a PID, it creates a URL, e.g. https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=, that returns that 404 HTTPError.

The same API endpoint can also use the file's database ID, e.g. https://dataverse.harvard.edu/file.xhtml?fileId=6570505, where 6570505 is the database ID of the file "Indata_2022.10.02_19.44.51.zip".

So could that download_url always look for the file's database ID, instead, and use that to create the download_url?

santisoler commented 1 year ago

Hi @jggautier! Thanks for opening this issue.

I'll open a PR to fix this bug. I see your point and I do think we should add support for files that don't have a persistentID.

I'm going to download the file using the ID only if the persistentID is empty. This way we will still be using the persistentID for any file that provides it, and fallback to the file ID for these corner cases.

cc @dokempf

jggautier commented 1 year ago

That sounds great, thanks!

Just in case it's helpful, I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs. That is, when I checked in October, 49 of 70 repositories had files with no persistentIDs. And the Dataverse community has been talking about turning off file PID registration (it's on by default, and many repositories have had to turn it off due to cost and performance issue), so I think it's likely that more files will be published without file PIDs than with them.

dokempf commented 1 year ago

Thanks for the report @jggautier - I was implementing this based on my own DataVerse experience, which seems to be with an instance that has PIDs on files. I was not aware of the fact.

@santisoler I am also available for this if your too constrained - just ask. Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite? Maybe @jggautier could provide access to one.

santisoler commented 1 year ago

Thanks @dokempf for jumping in. I've already open an PR to fix this (#355), but it still needs some tests. Feel free to take over that PR.

Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite?

For sure, that would be excellent.

I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs.

Sorry, I misunderstood it. Thanks for the clarification.

I see that probably persistentIDs will become obsolete in the near future. But since the documentation of the Dataverse API offers the persistentID as the first option and mentions the id as an alternative, I would be inclined to keep the former as the first approach and fallback to the latter if the persistent id is empty (the changes in #355 reflect this). I think it's a conservative decision that lowers the chance of breaking backward compatibility, doesn't introduce a performance hit (it just needs to check if an string is empty) and actually solves this bug.

Let me know if there's any detail I'm not seeing. I'm actually quite new to Dataverse. Actually, I found out about it through @dokempf contributions to Pooch (so big thanks for that).

jggautier commented 1 year ago

Your point about what the API documentation implies makes sense to me. Thanks so much for the insight. I'd guess that documentation was written when the ability to register persistentIDs for files was added to Dataverse, and it was assumed that most files would be getting persistentIDs. It wasn't anticipated that repositories would need to turn off file persistentIDs. @pdurbin who's also at Dataverse and works on Dataverse APIs a lot would know more. :)

It's possible that in the future, different types of users, like depositors and curators, will have more control over which of their datasets in a repository do and don't get file persistentIDs. (And the documentation will be updated.)

You're approach sounds fine to me, although I'm no developer! :)

Thanks again!

leouieda commented 1 year ago

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

dokempf commented 1 year ago

The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

leouieda commented 1 year ago

We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.

Ah, so both the database ID and PID are in that same response? Then forget what I said 🙂

pdurbin commented 1 year ago

Hi! I just left a review: https://github.com/fatiando/pooch/pull/355#pullrequestreview-1346501936

fatiando / pooch

Unable to download files in Dataverse repositories when files don't have PIDs #354