Closed Mikanebu closed 4 years ago
This is happening cause Cloudflare does not really like urllib and blocks as bot. For now, we turned Browser Integrity Check off on Cloudflare and 403 should not be issue any more
TESTED: working:
>>> from urllib.request import urlopen
>>> urlopen('https://pkgstore.datahub.io/core/country-list/data_csv/data/d7c9d7c
fb42cb69f4422dec222dbbaa8/data_csv.csv')
<http.client.HTTPResponse object at 0x7f91fcb58be0>
>>> responce = urlopen('https://pkgstore.datahub.io/core/country-list/data_csv/d
ata/d7c9d7cfb42cb69f4422dec222dbbaa8/data_csv.csv')
>>> responce.read()
b'Name,Code\r\nAfghanistan,AF\r\n\xc3\x85land Islands,AX\r\nAlbania,AL\r\nAlgeri
a,DZ\r\nAmerican Samoa,AS\r\nAndorra,AD\r\nAngola,AO\r\nAnguilla,AI\r\nAntarctic
a,AQ\r\nAntigua and Barbuda,AG\r\nArgentina,AR\r\nArmenia,AM\r\nAruba,AW\r\nAust
ralia,AU\r\nAustria,AT\r\nAzerbaijan,AZ\r\nBahamas,BS\r\nBahrain,BH\r\nBanglades
h,BD\r\nBarbados,BB\r\nBelarus,BY\r\nBelgium,BE\r\nB...
FIXED by changing the Cloudflare settings.
I'm facing this issue again:
In [5]: from urllib.request import urlopen
In [6]: urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-6-24c058eb4da4> in <module>
----> 1 urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
--> 531 response = meth(req, response)
532
533 return response
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in http_response(self, request, response)
638 # request was successfully received, understood, and accepted.
639 if not (200 <= code < 300):
--> 640 response = self.parent.error(
641 'http', request, response, code, msg, hdrs)
642
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in error(self, proto, *args)
567 if http_err:
568 args = (dict, 'default', 'http_error_default') + orig_args
--> 569 return self._call_chain(*args)
570
571 # XXX probably also want an abstract factory that knows when it makes
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
500 for handler in handlers:
501 func = getattr(handler, meth_name)
--> 502 result = func(*args)
503 if result is not None:
504 return result
~/.pyenv/versions/3.8.0/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
@AcckiyGerman Can you reopen it?
Hi. Facing the same issue as well. @AcckiyGerman is it possible to reopen it?
@jbelisario can you report the exact error you are encountering?
I concur
Python 3.6.9 (default, Apr 18 2020, 01:56:04)
Type "copyright", "credits" or "license" for more information.
IPython 5.5.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import datapackage
In [2]: import pandas
In [3]: package = datapackage.Package('https://datahub.io/JohnSnowLabs/populatio
...: n-figures-by-country/datapackage.json')
In [4]: pandas.read_csv(package.resources[4].descriptor['path'])
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-4-4a930361f83a> in <module>()
----> 1 pandas.read_csv(package.resources[4].descriptor['path'])
/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name
/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
429 # See https://github.com/python/mypy/issues/1297
430 fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 431 filepath_or_buffer, encoding, compression
432 )
433 kwds["compression"] = compression
/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
170
171 if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
--> 172 req = urlopen(filepath_or_buffer)
173 content_encoding = req.headers.get("Content-Encoding", None)
174 if content_encoding == "gzip":
/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
139 import urllib.request
140
--> 141 return urllib.request.urlopen(*args, **kwargs)
142
143
/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
221 else:
222 opener = _opener
--> 223 return opener.open(url, data, timeout)
224
225 def install_opener(opener):
/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
530 for processor in self.process_response.get(protocol, []):
531 meth = getattr(processor, meth_name)
--> 532 response = meth(req, response)
533
534 return response
/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
640 if not (200 <= code < 300):
641 response = self.parent.error(
--> 642 'http', request, response, code, msg, hdrs)
643
644 return response
/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
568 if http_err:
569 args = (dict, 'default', 'http_error_default') + orig_args
--> 570 return self._call_chain(*args)
571
572 # XXX probably also want an abstract factory that knows when it makes
/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result
/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
648 class HTTPDefaultErrorHandler(BaseHandler):
649 def http_error_default(self, req, fp, code, msg, hdrs):
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
@sglavoie can you investigate this and see if you can find a fix.
It works if I spoof the user agent
req = urllib.request.Request(package.resources[4].descriptor['path'], headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
population = pandas.read_csv(urllib.request.urlopen(req))
So it really looks like the old problem
@rufuspollock, I can confirm the issue is still ongoing and that the workaround from @wtkranz works like a charm.
@zelima, can we make sure once more that the "Browser Integrity Check" is turned off on CloudFlare, please?
FIXED: Closing as this is now working as expected. The "Browser Integrity Check" setting has been turned off as this was the original solution. If this is not what is desired this time, please re-open the issue so we can have a closer look again.
If the solution is deemed too drastic, it can be moderated by creating page rules.
Result as of now (with previously non-working code):
>>> from urllib.request import urlopen
>>>
>>> response = urlopen('https://pkgstore.datahub.io/core/country-codes/country-codes_csv/data/3b9fd39bdadd7edd7f7dcee708f47e1b/country-codes_csv.csv')
>>> response.status
200
We are getting 403 forbidden from CloudFlare if you try and open url with urllib.
How to reproduce
urllib
library to fetch resourcesExpected behaviour
No forbidden message