fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
625 stars 74 forks source link

HTTPError: 403 Client Error: Forbidden for url: #178

Closed MarkWieczorek closed 4 years ago

MarkWieczorek commented 4 years ago

I am trying to download a text file that is part of the online supplemental materials for a recent JGR article, but I am receiving an error HTTPError: 403 Client Error: Forbidden for url: The file exists, as you can verify by copying the url into a browser.

Though I have no idea what the problem is, I am guessing that it might have something to do with the odd file name with lots of control characters in it.

I am using the latest version on pypi (please note that pooch.__version__ is not defined and that pooch.version does not return a version string).

In [1]: from pooch import retrieve
In [2]: from pooch import HTTPDownloader
In [3]: fname = retrieve(
   ...:     url="https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&f
   ...: ile=jgre21147-sup-0003-Table_SI-S01.txt",  # noqa: E501
   ...:     known_hash="sha256:a2c89bb3af70cd76654f6ab6b4e0844f972055970b593ec29153d59ecc78180c",  # noqa: E501
   ...:
   ...:     downloader=HTTPDownloader(progressbar=True),
   ...: )
Downloading data from 'https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&file=jgre21147-sup-0003-Table_SI-S01.txt' to file '/Users/lunokhod/Library/Caches/pooch/0394b1ebfc19775b033e2e61fafffb1e-downloadSupplement'.
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-452acbd9d492> in <module>
      2     url="https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&file=jgre21147-sup-0003-Table_SI-S01.txt",  # noqa: E501
      3     known_hash="sha256:a2c89bb3af70cd76654f6ab6b4e0844f972055970b593ec29153d59ecc78180c",  # noqa: E501
----> 4     downloader=HTTPDownloader(progressbar=True),
      5 )

/usr/local/lib/python3.7/site-packages/pooch/core.py in retrieve(url, known_hash, fname, path, processor, downloader)
    216             downloader = choose_downloader(url)
    217
--> 218         stream_download(url, full_path, known_hash, downloader, pooch=None)
    219
    220         if known_hash is None:

/usr/local/lib/python3.7/site-packages/pooch/core.py in stream_download(url, fname, known_hash, downloader, pooch)
    745     # before overwriting the original.
    746     with temporary_file(path=str(fname.parent)) as tmp:
--> 747         downloader(url, tmp, pooch)
    748         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    749         shutil.move(tmp, str(fname))

/usr/local/lib/python3.7/site-packages/pooch/downloaders.py in __call__(self, url, output_file, pooch)
    166         try:
    167             response = requests.get(url, **kwargs)
--> 168             response.raise_for_status()
    169             content = response.iter_content(chunk_size=self.chunk_size)
    170             if self.progressbar:

/usr/local/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
    939
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942
    943     def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&file=jgre21147-sup-0003-Table_SI-S01.txt
leouieda commented 4 years ago

I am using the latest version on pypi (please note that pooch.version is not defined and that pooch.version does not return a version string).

Ah right, that was something I changed and completely forgot to document. pooch.version is a module and pooch.version.full_version should be the string. We should still define __version__ probably for compatibility.

Though I have no idea what the problem is, I am guessing that it might have something to do with the odd file name with lots of control characters in it.

That shouldn't be a problem since Python 3 uses all unicode by default and the characters seems to be escaped properly.

The 403 error code means that the server understood the request but refused to do it because of client parameters (in this case, requests). My guess is that the journal doesn't allow automated downloads of the data, which I've heard about before.

I tried it with curl -L "https://agupubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1029%2F2018JE005854&file=jgre21147-sup-0003-Table_SI-S01.txt" -o meh.txt and got back a page saying that cookies must be enabled for login. So it's probably the journal not allowing non-browser based downloads. Which is ridiculous and why no one should ever put their data in journal supplements.

leouieda commented 4 years ago

I wonder if there is something we can do about this. Maybe fool the server somehow? But this is getting a bit out of my league.

@andersy005 @hugovk @remram44 do any of you know anything about this (or maybe know someone who does)?

MarkWieczorek commented 4 years ago

I just contacted the journal to let them know about this. I'll let you know what they say. (For info, requests has a similar problem as curl.)

Personally, I think that it is crazy that we have to pay to publish in a journal, and then the journal refuses to archive our datasets. It's even worse like in this scenario where the archived data are not easily accessible...

remram44 commented 4 years ago

I think you might be hitting some kind of CloudFlare or Wiley protection. I am also hitting the 403, but not every time. Sometime it will first redirect to ...&cookieSet=1. I think you might want to copy the file somewhere else and link to it, as this seems to have been setup by Wiley specifically to prevent automated hits.

leouieda commented 4 years ago

@remram44 yeah I was getting that until I told curl to follow redirects. I remember seeing some rants on twitter about publishers forbidding scraping, which I'm guess is what we're running into.

@MarkWieczorek yeah, it's very frustrating. Let us know what they write back. But I'm not surprised if they just say "no".

hugovk commented 4 years ago

Sometimes setting a user-agent in the request header can make things work:

hugovk commented 4 years ago

I am using the latest version on pypi (please note that pooch.__version__ is not defined and that pooch.version does not return a version string).

Ah right, that was something I changed and completely forgot to document. pooch.version is a module and pooch.version.full_version should be the string. We should still define __version__ probably for compatibility.

Please see PR https://github.com/fatiando/pooch/pull/179.

MarkWieczorek commented 4 years ago

And here is the response from the American Geophysical Union:

According to Wylie and AGU guidelines, Supporting Information is not regarded as a proper repository for datasets, for exactly the kind of issues you experience. That’s exactly why we request our authors to upload their datasets on Zenodo or other FAIR-enabling repositories.

leouieda commented 4 years ago

Good to have them admitting that their own system is useless :) Might be worth reaching out to authors about posting to Zenodo.

MarkWieczorek commented 4 years ago

As I was only interested in 2 small datasets, I just uploaded them myself...

leouieda commented 4 years ago

Closing this since it's not something we can likely solve in Pooch