EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
801 stars 133 forks source link

Errors with pmlb.fetch_data() #140

Closed flyingsohigh2020 closed 3 years ago

flyingsohigh2020 commented 3 years ago

hello,

Just recently I encountered the following error when using pmlb.fetch_data() in python with a jupyter notebook. The python version is 3.7.4, and the pmlb version is 1.0.2a0 or 1.0.1.post3. Could you let us know what might be the problem? Thanks!

from pmlb import fetch_data

Returns a pandas DataFrame

mushroom = fetch_data('mushroom') mushroom.describe().transpose()


SSLCertVerificationError Traceback (most recent call last) ~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 676 headers=headers, --> 677 chunked=chunked, 678 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 380 try: --> 381 self._validate_conn(conn) 382 except (SocketTimeout, BaseSSLError) as e:

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in _validate_conn(self, conn) 977 if not getattr(conn, "sock", None): # AppEngine might not have .sock --> 978 conn.connect() 979

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connection.py in connect(self) 370 server_hostname=server_hostname, --> 371 ssl_context=context, 372 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\util\ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data) 383 if HAS_SNI and server_hostname is not None: --> 384 return context.wrap_socket(sock, server_hostname=server_hostname) 385

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session) 422 context=self, --> 423 session=session 424 )

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session) 869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets") --> 870 self.do_handshake() 871 except (OSError, ValueError):

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in do_handshake(self, block) 1138 self.settimeout(None) -> 1139 self._sslobj.do_handshake() 1140 finally:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last) ~\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 726 retries = retries.increment( --> 727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] 728 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 438 if new_retry.is_exhausted(): --> 439 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 440

MaxRetryError: HTTPSConnectionPool(host='media.githubusercontent.com', port=443): Max retries exceeded with url: /media/EpistasisLab/pmlb/master/datasets/mushroom/mushroom.tsv.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))

During handling of the above exception, another exception occurred:

SSLError Traceback (most recent call last)

in 2 3 # Returns a pandas DataFrame ----> 4 mushroom = fetch_data('mushroom') 5 mushroom.describe().transpose() ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pmlb\pmlb.py in fetch_data(dataset_name, return_X_y, local_cache_dir, dropna) 77 raise ValueError('Dataset not found in PMLB.') 78 dataset_url = get_dataset_url(GITHUB_URL, ---> 79 dataset_name, suffix) 80 dataset = pd.read_csv(dataset_url, sep='\t', compression='gzip') 81 else: ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pmlb\pmlb.py in get_dataset_url(GITHUB_URL, dataset_name, suffix) 116 ) 117 --> 118 re = requests.get(dataset_url) 119 if re.status_code != 200: 120 raise ValueError('Dataset not found in PMLB.') ~\AppData\Roaming\Python\Python37\site-packages\requests\api.py in get(url, params, **kwargs) 74 75 kwargs.setdefault('allow_redirects', True) ---> 76 return request('get', url, params=params, **kwargs) 77 78 ~\AppData\Roaming\Python\Python37\site-packages\requests\api.py in request(method, url, **kwargs) 59 # cases, and look like a memory leak in others. 60 with sessions.Session() as session: ---> 61 return session.request(method=method, url=url, **kwargs) 62 63 ~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 528 } 529 send_kwargs.update(settings) --> 530 resp = self.send(prep, **send_kwargs) 531 532 return resp ~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in send(self, request, **kwargs) 663 # Redirect resolving generator. 664 gen = self.resolve_redirects(r, request, **kwargs) --> 665 history = [resp for resp in gen] 666 else: 667 history = [] ~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in (.0) 663 # Redirect resolving generator. 664 gen = self.resolve_redirects(r, request, **kwargs) --> 665 history = [resp for resp in gen] 666 else: 667 history = [] ~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs) 243 proxies=proxies, 244 allow_redirects=False, --> 245 **adapter_kwargs 246 ) 247 ~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in send(self, request, **kwargs) 641 642 # Send the request --> 643 r = adapter.send(request, **kwargs) 644 645 # Total elapsed time of the request (approximately) ~\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 512 if isinstance(e.reason, _SSLError): 513 # This branch is for urllib3 v1.22 and later. --> 514 raise SSLError(e, request=request) 515 516 raise ConnectionError(e, request=request) SSLError: HTTPSConnectionPool(host='media.githubusercontent.com', port=443): Max retries exceeded with url: /media/EpistasisLab/pmlb/master/datasets/mushroom/mushroom.tsv.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))
trangdata commented 3 years ago

Hi @flyingsohigh2020, not sure why you received this error. I tried to replicate your code on this Google colab notebook and everything seems to work fine. Perhaps this is a user's connection issue on your end?

flyingsohigh2020 commented 3 years ago

It worked before, and just start to fail last week. Could you give some suggestion on how to fix it? Thanks

Sent from my iPhone

On Oct 27, 2020, at 4:23 PM, Trang Le notifications@github.com wrote:

 Hi @flyingsohigh2020, not sure why you received this error. I tried to replicate your code on this Google colab notebook and everything seems to work fine. Perhaps this is a user's connection issue on your end?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

weixuanfu commented 3 years ago

Hi @flyingsohigh2020, I am not sure why it is not working this week on your end. Could you please sharing your OS version, versions of all the dependencies of pmlb herein?

It could be an issue of a firewall/security setting in your computer (maybe anti-virus software) or local network in your organization that block the process of SSL certificate verification (host='media.githubusercontent.com', port=443). Could you please try fetch_data from pmlb in another computer in your local network to check if this is an issue related to the local network? If so, please contact local network administrator to check it. If not, you may need check security setting in your computer.

Tarek0 commented 3 years ago

I am also getting the same problem with the fetch_data() function. This was working perfectly before, so I don't think it's on my end. I have tried with two laptops (one mac and one linux) and there is no difference I get the same 404 error.

macos version: 18.7.0 pmlb: 0.3 All dependencies
absl-py==0.10.0 alabaster==0.7.12 alembic==1.4.1 anaconda-client==1.7.2 anaconda-navigator==1.9.7 anaconda-project==0.8.2 appnope==0.1.0 appscript==1.0.1 archspec==0.1.1 asn1crypto==0.24.0 astor==0.8.1 astroid==2.2.5 astropy==3.1.2 astunparse==1.6.3 atomicwrites==1.3.0 attrs==19.3.0 awscli==1.16.150 Babel==2.6.0 backcall==0.1.0 backports.os==0.1.1 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.8.2 bitarray==0.8.3 bkcharts==0.2 bleach==3.1.0 bokeh==1.0.4 boto==2.49.0 boto3==1.9.141 botocore==1.12.141 Bottleneck==1.2.1 branca==0.3.1 cachetools==4.1.1 certifi==2020.6.20 cffi==1.13.2 chardet==3.0.4 Click==7.0 cloudpickle==0.8.0 clyent==1.2.2 colorama==0.3.9 conda==4.8.4 conda-build==3.17.8 conda-package-handling==1.6.0 conda-verify==3.1.1 configparser==4.0.2 ConfigSpace==0.4.13 contextlib2==0.5.5 convertdate==2.2.0 cryptography==2.8 cycler==0.10.0 Cython==0.29.6 cytoolz==0.9.0.1 dask==1.1.4 databricks-cli==0.9.1 decorator==4.4.1 defusedxml==0.6.0 descartes==1.1.0 dill==0.3.1.1 distributed==1.26.0 docker==4.2.0 docutils==0.16 entrypoints==0.3 et-xmlfile==1.0.1 Faker==2.0.2 fastcache==1.0.2 feather-format==0.4.0 filelock==3.0.10 findspark==1.3.0 Flask==1.0.2 folium==0.10.0 future==0.18.2 gast==0.3.3 geojson==2.5.0 gevent==1.4.0 gitdb==4.0.2 GitPython==3.1.0 glob2==0.7 gmpy2==2.0.8 google-auth==1.21.1 google-auth-oauthlib==0.4.1 google-pasta==0.2.0 gorilla==0.3.0 greenlet==0.4.15 grpcio==1.32.0 gunicorn==20.0.4 h5py==2.10.0 heapdict==1.0.0 holidays==0.9.11 html5lib==1.0.1 idna==2.8 imageio==2.5.0 imagesize==1.1.0 importlib-metadata==1.5.0 inflect==4.1.0 ipykernel==5.1.4 ipython==7.12.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isort==4.3.16 itsdangerous==1.1.0 jaraco.itertools==5.0.0 JayDeBeApi==1.1.1 jdcal==1.4 jedi==0.16.0 Jinja2==2.11.1 jmespath==0.9.4 joblib==0.14.1 JPype1==0.7.0 json5==0.9.0 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==5.3.4 jupyter-console==6.1.0 jupyter-core==4.6.1 jupyterlab==1.2.6 jupyterlab-server==1.0.6 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.2 keyring==18.0.0 kiwisolver==1.0.1 lazy-object-proxy==1.3.1 liac-arff==2.4.0 libarchive-c==2.9 lief==0.9.0 llvmlite==0.28.0 locket==0.2.0 lockfile==0.12.2 lunardate==0.2.0 lxml==4.3.2 Mako==1.1.2 Markdown==3.2.2 MarkupSafe==1.1.1 matplotlib==3.2.2 mccabe==0.6.1 mistune==0.8.4 mkl-fft==1.1.0 mkl-random==1.1.0 mkl-service==2.3.0 mlflow==1.7.0 more-itertools==8.2.0 mpmath==1.1.0 msgpack==0.6.1 multipledispatch==0.6.0 multiprocess==0.70.9 navigator-updater==0.2.1 nbconvert==5.6.1 nbformat==5.0.4 networkx==2.2 nltk==3.4 nose==1.3.7 notebook==6.0.3 numba==0.43.1 numexpr==2.6.9 numpy==1.18.1 numpydoc==0.8.0 oauthlib==3.1.0 olefile==0.46 openpyxl==2.6.1 opt-einsum==3.3.0 ortools==7.5.7466 packaging==19.0 pandas==0.24.2 pandocfilters==1.4.2 parso==0.6.1 partd==0.3.10 path.py==11.5.0 pathlib2==2.3.3 pathos==0.2.3 patsy==0.5.1 pep8==1.7.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==6.2.1 pkginfo==1.5.0.1 pluggy==0.9.0 ply==3.11 pmlb==0.3 pox==0.2.7 ppft==1.6.6.1 prometheus-client==0.7.1 prometheus-flask-exporter==0.13.0 prompt-toolkit==3.0.3 protobuf==3.11.3 psutil==5.6.7 ptyprocess==0.6.0 py==1.8.0 py4j==0.10.7 pyarrow==0.13.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycurl==7.43.0.2 pydot==1.4.1 pyflakes==2.1.1 Pygments==2.5.2 pylint==2.3.1 PyMeeus==0.3.6 pynisher==0.5.0 pyodbc==4.0.26 pyOpenSSL==19.1.0 pyparsing==2.3.1 pyrsistent==0.15.7 PySocks==1.7.1 pyspark==2.4.4 pystan==2.19.1.1 pytest==4.3.1 pytest-arraydiff==0.3 pytest-astropy==0.5.0 pytest-doctestplus==0.3.0 pytest-openfiles==0.3.2 pytest-remotedata==0.3.1 python-dateutil==2.8.1 python-editor==1.0.4 pytz==2019.3 PyWavelets==1.0.2 PyYAML==3.13 pyzmq==18.1.1 QtAwesome==0.5.7 qtconsole==4.4.3 QtPy==1.9.0 querystring-parser==1.2.4 requests==2.22.0 requests-oauthlib==1.3.0 rope==0.12.0 rsa==3.4.2 ruamel-yaml==0.15.71 s3transfer==0.2.0 scikit-image==0.14.2 scikit-learn==0.22.2.post1 scipy==1.4.1 seaborn==0.9.0 Send2Trash==1.5.0 setuptools-git==1.2 simplegeneric==0.8.1 simplejson==3.17.0 singledispatch==3.4.0.3 six==1.14.0 smmap==3.0.1 snowballstemmer==1.2.1 sobol-seq==0.2.0 sortedcollections==1.1.2 sortedcontainers==2.1.0 soupsieve==1.9.4 Sphinx==1.8.5 sphinxcontrib-websupport==1.1.0 spyder==3.3.3 spyder-kernels==0.4.2 SQLAlchemy==1.3.1 sqlparse==0.3.1 statsmodels==0.9.0 sympy==1.3 tables==3.5.1 tabulate==0.8.6 tblib==1.3.2 tensorboard==2.3.0 tensorboard-plugin-wit==1.7.0 tensorflow==2.3.0 tensorflow-estimator==2.3.0 termcolor==1.1.0 terminado==0.8.3 testpath==0.4.4 text-unidecode==1.3 toolz==0.9.0 tornado==6.0.3 tqdm==4.42.1 traitlets==4.3.3 unicodecsv==0.14.1 urllib3==1.24.3 vincent==0.4.4 wcwidth==0.1.8 webencodings==0.5.1 websocket-client==0.57.0 Werkzeug==0.14.1 widgetsnbextension==3.5.1 wrapt==1.11.1 wurlitzer==1.0.2 xlrd==1.2.0 XlsxWriter==1.1.5 xlwings==0.15.4 xlwt==1.3.0 zict==0.1.4 zipp==2.1.0

lacava commented 3 years ago

hi @Tarek0 , please upgrade to 1.0 or newer. 1.0 introduced breaking changes as noted in the README.

Tarek0 commented 3 years ago

Thanks that worked a charm. My bad I should have checked the README for updates.

flyingsohigh2020 commented 3 years ago

Thanks for the tip. It works fine now. Weihua

On Mon, Dec 21, 2020 at 11:45 AM Trang Le notifications@github.com wrote:

Reopened #140 https://github.com/EpistasisLab/pmlb/issues/140.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EpistasisLab/pmlb/issues/140#event-4136761649, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARRE2574TQ67CQFHYVLXTEDSV53TFANCNFSM4TBFE37Q .