WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
219 stars 78 forks source link

OJS scraper fails with SSLError on certain websites #1719

Open fnielsen opened 2 years ago

fnielsen commented 2 years ago

OJS scraper fails with SSLError on certain websites. For instance,

$ python -m scholia.scrape.ojs issue-url-to-quickstatements https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/issue/view/20

This can be reproduced with:

>>> requests.get("https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/issue/view/20")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 411, in connect
    self.sock = ssl_wrap_socket(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/usr/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='infoteka.bg.ac.rs', port=443): Max retries exceeded with url: /ojs/index.php/Infoteka/issue/view/20 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='infoteka.bg.ac.rs', port=443): Max retries exceeded with url: /ojs/index.php/Infoteka/issue/view/20 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

A workaround is available: https://stackoverflow.com/questions/61631955/python-requests-ssl-error-during-requests apparently based on https://github.com/psf/requests/issues/4775

fnielsen commented 2 years ago

The workaround does not work:

From https://stackoverflow.com/questions/61631955/python-requests-ssl-error-during-requests

url = 'https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/issue/view/20'

import requests
import ssl
from urllib3 import poolmanager

class TLSAdapter(requests.adapters.HTTPAdapter):

    def init_poolmanager(self, connections, maxsize, block=False):
        """Create and initialize the urllib3 PoolManager."""
        ctx = ssl.create_default_context()
        ctx.set_ciphers('ALL') # DEFAULT@SECLEVEL=1')
        self.poolmanager = poolmanager.PoolManager(
                num_pools=connections,
                maxsize=maxsize,
                block=block,
                ssl_version=ssl.PROTOCOL_TLS,
                ssl_context=ctx)

session = requests.session()
session.mount('https://', TLSAdapter())
res = session.get(url)
print(res)

Still errs

fnielsen commented 2 years ago

Add verify=False neither works:

url = 'https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/issue/view/20'

import requests
res = requests.get(url, verify=False)
print(res)
fnielsen commented 2 years ago

https://stackoverflow.com/questions/70185534/ssl-sslerror-ssl-wrong-signature-type-wrong-signature-type-with-python-reque

fnielsen commented 2 years ago

More info: https://www.openssl.org/news/vulnerabilities.html#CVE-2020-1967 (Why is this not upgraded?)