datopian / datahub-qa

:package: Bugs, issues and suggestions for datahub.io
https://datahub.io/
32 stars 6 forks source link

HTTP Error 403: Forbidden for resources #81

Closed Mikanebu closed 4 years ago

Mikanebu commented 6 years ago

We are getting 403 forbidden from CloudFlare if you try and open url with urllib.

from urllib.request import urlopen
urlopen('https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv')

Traceback (most recent call last):
  ...
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

How to reproduce

Expected behaviour

No forbidden message

zelima commented 6 years ago

This is happening cause Cloudflare does not really like urllib and blocks as bot. For now, we turned Browser Integrity Check off on Cloudflare and 403 should not be issue any more

AcckiyGerman commented 6 years ago

TESTED: working:

>>> from urllib.request import urlopen
>>> urlopen('https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7c
fb42cb69f4422dec222dbbaa8/data_csv.csv')
<http.client.HTTPResponse object at 0x7f91fcb58be0>
>>> responce = urlopen('https://pkgstore.datahub.io/core/country-list/data_csv/d
ata/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv')
>>> responce.read()
b'Name,Code\r\nAfghanistan,AF\r\n\xc3\x85land Islands,AX\r\nAlbania,AL\r\nAlgeri
a,DZ\r\nAmerican Samoa,AS\r\nAndorra,AD\r\nAngola,AO\r\nAnguilla,AI\r\nAntarctic
a,AQ\r\nAntigua and Barbuda,AG\r\nArgentina,AR\r\nArmenia,AM\r\nAruba,AW\r\nAust
ralia,AU\r\nAustria,AT\r\nAzerbaijan,AZ\r\nBahamas,BS\r\nBahrain,BH\r\nBanglades
h,BD\r\nBarbados,BB\r\nBelarus,BY\r\nBelgium,BE\r\nB...

FIXED by changing the Cloudflare settings.

francbartoli commented 4 years ago

I'm facing this issue again:

In [5]: from urllib.request import urlopen

In [6]: urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-6-24c058eb4da4> in <module>
----> 1 urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223
    224 def install_opener(opener):

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532
    533         return response

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in http_response(self, request, response)
    638         # request was successfully received, understood, and accepted.
    639         if not (200 <= code < 300):
--> 640             response = self.parent.error(
    641                 'http', request, response, code, msg, hdrs)
    642

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570
    571 # XXX probably also want an abstract factory that knows when it makes

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    500         for handler in handlers:
    501             func = getattr(handler, meth_name)
--> 502             result = func(*args)
    503             if result is not None:
    504                 return result

~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

@AcckiyGerman Can you reopen it?

jbelisario commented 4 years ago

Hi. Facing the same issue as well. @AcckiyGerman is it possible to reopen it?

rufuspollock commented 4 years ago

@jbelisario can you report the exact error you are encountering?

wtkranz commented 4 years ago

I concur

Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
Type "copyright", "credits" or "license" for more information.

IPython 5.5.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import datapackage

In [2]: import pandas

In [3]: package = datapackage.Package('https://datahub.io/JohnSnowLabs/populatio
   ...: n-figures-by-country/datapackage.json')

In [4]: pandas.read_csv(package.resources[4].descriptor['path'])
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-4a930361f83a> in <module>()
----> 1 pandas.read_csv(package.resources[4].descriptor['path'])

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    429     # See https://github.com/python/mypy/issues/1297
    430     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 431         filepath_or_buffer, encoding, compression
    432     )
    433     kwds["compression"] = compression

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    170 
    171     if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
--> 172         req = urlopen(filepath_or_buffer)
    173         content_encoding = req.headers.get("Content-Encoding", None)
    174         if content_encoding == "gzip":

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
    139     import urllib.request
    140 
--> 141     return urllib.request.urlopen(*args, **kwargs)
    142 
    143 

/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden
rufuspollock commented 4 years ago

@sglavoie can you investigate this and see if you can find a fix.

wtkranz commented 4 years ago

It works if I spoof the user agent

req = urllib.request.Request(package.resources[4].descriptor['path'], headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
population = pandas.read_csv(urllib.request.urlopen(req))

So it really looks like the old problem

sglavoie commented 4 years ago

@rufuspollock, I can confirm the issue is still ongoing and that the workaround from @wtkranz works like a charm.

@zelima, can we make sure once more that the "Browser Integrity Check" is turned off on CloudFlare, please?

sglavoie commented 4 years ago

FIXED: Closing as this is now working as expected. The "Browser Integrity Check" setting has been turned off as this was the original solution. If this is not what is desired this time, please re-open the issue so we can have a closer look again.

If the solution is deemed too drastic, it can be moderated by creating page rules.

Result as of now (with previously non-working code):

>>> from urllib.request import urlopen
>>>
>>> response = urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')
>>> response.status
200