Closed MarkWieczorek closed 4 years ago
I am using the latest version on pypi (please note that pooch.version is not defined and that pooch.version does not return a version string).
Ah right, that was something I changed and completely forgot to document. pooch.version
is a module and pooch.version.full_version
should be the string. We should still define __version__
probably for compatibility.
Though I have no idea what the problem is, I am guessing that it might have something to do with the odd file name with lots of control characters in it.
That shouldn't be a problem since Python 3 uses all unicode by default and the characters seems to be escaped properly.
The 403 error code means that the server understood the request but refused to do it because of client parameters (in this case, requests
). My guess is that the journal doesn't allow automated downloads of the data, which I've heard about before.
I tried it with curl -L "https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&file=jgre21147-sup-0003-Table_SI-S01.txt" -o meh.txt
and got back a page saying that cookies must be enabled for login. So it's probably the journal not allowing non-browser based downloads. Which is ridiculous and why no one should ever put their data in journal supplements.
I wonder if there is something we can do about this. Maybe fool the server somehow? But this is getting a bit out of my league.
@andersy005 @hugovk @remram44 do any of you know anything about this (or maybe know someone who does)?
I just contacted the journal to let them know about this. I'll let you know what they say. (For info, requests
has a similar problem as curl
.)
Personally, I think that it is crazy that we have to pay to publish in a journal, and then the journal refuses to archive our datasets. It's even worse like in this scenario where the archived data are not easily accessible...
I think you might be hitting some kind of CloudFlare or Wiley protection. I am also hitting the 403, but not every time. Sometime it will first redirect to ...&cookieSet=1
. I think you might want to copy the file somewhere else and link to it, as this seems to have been setup by Wiley specifically to prevent automated hits.
@remram44 yeah I was getting that until I told curl to follow redirects. I remember seeing some rants on twitter about publishers forbidding scraping, which I'm guess is what we're running into.
@MarkWieczorek yeah, it's very frustrating. Let us know what they write back. But I'm not surprised if they just say "no".
Sometimes setting a user-agent in the request header can make things work:
I am using the latest version on pypi (please note that
pooch.__version__
is not defined and thatpooch.version
does not return a version string).Ah right, that was something I changed and completely forgot to document.
pooch.version
is a module andpooch.version.full_version
should be the string. We should still define__version__
probably for compatibility.
Please see PR https://github.com/fatiando/pooch/pull/179.
And here is the response from the American Geophysical Union:
According to Wylie and AGU guidelines, Supporting Information is not regarded as a proper repository for datasets, for exactly the kind of issues you experience. That’s exactly why we request our authors to upload their datasets on Zenodo or other FAIR-enabling repositories.
Good to have them admitting that their own system is useless :) Might be worth reaching out to authors about posting to Zenodo.
As I was only interested in 2 small datasets, I just uploaded them myself...
Closing this since it's not something we can likely solve in Pooch
I am trying to download a text file that is part of the online supplemental materials for a recent JGR article, but I am receiving an error
HTTPError: 403 Client Error: Forbidden for url:
The file exists, as you can verify by copying the url into a browser.Though I have no idea what the problem is, I am guessing that it might have something to do with the odd file name with lots of control characters in it.
I am using the latest version on pypi (please note that
pooch.__version__
is not defined and thatpooch.version
does not return a version string).