biolds / sosse

Selenium Open Source Search Engine & crawler
GNU Affero General Public License v3.0
36 stars 3 forks source link

Ignore SSL Certificate Errors? #4

Open webitsanu opened 8 months ago

webitsanu commented 8 months ago

I was wondering if it might be possible to introduce a checkbox / option to ignore SSL Certificate Errors? I am trying to crawl a website that has a problem with its certificates and so the crawler is failing. It is nice to know the certificates are problematic, but would be amazing if the crawl could continue.

Please find the errors below:

File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1096, in _validate_conn conn.connect() File "/venv/lib/python3.11/site-packages/urllib3/connection.py", line 642, in connect sock_and_verified = _ssl_wrap_socket_and_match_hostname( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/urllib3/connection.py", line 782, in _ssl_wrap_socket_and_match_hostname ssl_sock = ssl_wrapsocket( ^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/urllib3/util/ssl.py", line 470, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, serverhostname) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/venv/lib/python3.11/site-packages/urllib3/util/ssl.py", line 514, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/ssl.py", line 517, in wrap_socket return self.sslsocket_class._create( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/ssl.py", line 1075, in _create self.do_handshake() File "/usr/lib/python3.11/ssl.py", line 1346, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)

I know chrome has an --ignore-certificate-errors flag, but I think there is a bit more to it. Seems like urllib3 and requests are also checjing certificates.

By the way, fantastic work on the app. It works brilliantly.

biolds commented 8 months ago

Thanks for the feature request! Indeed it seems like good idea to have that, i'll look into it.

You may be seeing an urllib traceback even though you are crawling with Chrome because urllib is used to check the robot.txt of the site. If that's the case you could ignore the robot.txt by updating the domain settings in http://x.x.x.x/admin/se/domainsetting/