Closed jggautier closed 9 months ago
While not all files in Dataverse repositories have PIDs, they always have database IDs, so the Dataverse API endpoint for accessing files can use the file's database ID instead.
I think the download_url function at https://github.com/fatiando/pooch/blob/a965902d26015453ac00269597a23b83d85db644/pooch/downloaders.py#L1022 tries to get a file's PID to create a download url for the file. And when the file doesn't have a PID, it creates a URL, e.g. https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=
, that returns that 404 HTTPError.
The same API endpoint can also use the file's database ID, e.g. https://dataverse.harvard.edu/file.xhtml?fileId=6570505
, where 6570505 is the database ID of the file "Indata_2022.10.02_19.44.51.zip".
So could that download_url always look for the file's database ID, instead, and use that to create the download_url?
Hi @jggautier! Thanks for opening this issue.
I'll open a PR to fix this bug. I see your point and I do think we should add support for files that don't have a persistentID.
I'm going to download the file using the ID only if the persistentID is empty. This way we will still be using the persistentID for any file that provides it, and fallback to the file ID for these corner cases.
cc @dokempf
That sounds great, thanks!
Just in case it's helpful, I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs. That is, when I checked in October, 49 of 70 repositories had files with no persistentIDs. And the Dataverse community has been talking about turning off file PID registration (it's on by default, and many repositories have had to turn it off due to cost and performance issue), so I think it's likely that more files will be published without file PIDs than with them.
Thanks for the report @jggautier - I was implementing this based on my own DataVerse experience, which seems to be with an instance that has PIDs on files. I was not aware of the fact.
@santisoler I am also available for this if your too constrained - just ask. Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite? Maybe @jggautier could provide access to one.
Thanks @dokempf for jumping in. I've already open an PR to fix this (#355), but it still needs some tests. Feel free to take over that PR.
Do you think having the test dataset on an instance without PIDs would be a valueable addition to the testsuite?
For sure, that would be excellent.
I'd like to clarify that I don't think it's a corner case that files in Dataverse repositories won't have persistentIDs.
Sorry, I misunderstood it. Thanks for the clarification.
I see that probably persistentIDs will become obsolete in the near future. But since the documentation of the Dataverse API offers the persistentID as the first option and mentions the id as an alternative, I would be inclined to keep the former as the first approach and fallback to the latter if the persistent id is empty (the changes in #355 reflect this). I think it's a conservative decision that lowers the chance of breaking backward compatibility, doesn't introduce a performance hit (it just needs to check if an string is empty) and actually solves this bug.
Let me know if there's any detail I'm not seeing. I'm actually quite new to Dataverse. Actually, I found out about it through @dokempf contributions to Pooch (so big thanks for that).
Your point about what the API documentation implies makes sense to me. Thanks so much for the insight. I'd guess that documentation was written when the ability to register persistentIDs for files was added to Dataverse, and it was assumed that most files would be getting persistentIDs. It wasn't anticipated that repositories would need to turn off file persistentIDs. @pdurbin who's also at Dataverse and works on Dataverse APIs a lot would know more. :)
It's possible that in the future, different types of users, like depositors and curators, will have more control over which of their datasets in a repository do and don't get file persistentIDs. (And the documentation will be updated.)
You're approach sounds fine to me, although I'm no developer! :)
Thanks again!
The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.
The only downside I can see to @santisoler's proposal is that for a majority of repositories we'll be hitting the API twice to get an ID for download. That will have a performance penalty since it will depend if the network and server speeds. Probably nothing major unless someone is downloading thousands of times.
We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.
We are actually hitting the API exactly once and cache the result of that. All the file information (PID or not) is in that one API response.
Ah, so both the database ID and PID are in that same response? Then forget what I said 🙂
Hi! I just left a review: https://github.com/fatiando/pooch/pull/355#pullrequestreview-1346501936
Description of the problem:
Hi! I was happy to hear that pooch has support for downloading files in repositories that use Dataverse. I played around with the commands a bit today.
The pooch.create and DOIDownloader functions work great when the files I want to download in Dataverse repositories have persistent identifiers. It looks like both functions assume that files in Dataverse repositories will have PIDs.
But many repositories using Dataverse don't register PIDs for their files (see a conversation in the Dataverse Google Group about which repositories do and don't). We can't assume that all files in Dataverse repositories will have PIDs, and it's likely that more repositories using Dataverse will "turn off" PID registration for their files.
So when I try to download a file that doesn't have a PID, I get an error.
Full code that generated the error
Using pooch.create:
Using DOIDownloader:
Full error message
System information