Add exception handling `--gather` option

fukusuket commented 1 year ago

Hello, thank you for maintaining the tool :)

When crawling URLs with the --gather option, you may not be sure that all URLs are reachable. Therefore, I changed it to continue crawling other URLs even if there is a connection error.

Changes Proposed

Added exception handling on connection errors when executed with --gather option.
- If you get a connection error then
- print out Failed to connect to [url] .
- continue crawling the rest of the URLs

Explanation of Changes

When executing with --gather option, if there is even one URL that cannot be connected, the program will exit with an exception as follows.

fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
Processing... |                                | 3/110/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Processing... |#                               | 6/110Traceback (most recent call last):
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/Users/fukusuke/.pyenv/versions/3.10.8/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x15ec34310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x15ec34310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/fukusuke/Scripts/Python/TorBot/run.py", line 8, in <module>
    torbot.perform_action()
  File "/Users/fukusuke/Scripts/Python/TorBot/torbot/main.py", line 86, in perform_action
    collect_data(args.url)
  File "/Users/fukusuke/Scripts/Python/TorBot/torbot/modules/collect_data.py", line 61, in collect_data
    resp = requests.get(link)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x15ec34310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

Screenshots of new feature/change

This PR outputs the URL that had the connection error and continues further processing as follows.

fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
Processing... |                                | 3/110/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.9/lib/python3.9/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Processing... |#                               | 6/110
Failed to connect to [http://s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion/].
Processing... |##                              | 7/110
...
Processing... |#######################         | 82/110
Failed to connect to [http://bible4u2lvhacg4b3to2e2veqpwmrc2c3tjf2wuuqiz332vlwmr4xbad.onion/].
Processing... |########################        | 83/110
Failed to connect to [http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion/].
Processing... |########################        | 84/110
Failed to connect to [http://nv3x2jozywh63fkohn5mwp2d73vasusjixn3im3ueof52fmbjsigw6ad.onion/].
Processing... |################################| 110/110
Data has been saved to /Users/fukusuke/Scripts/Python/TorBot/data/torbot_2023-01-16T19:53:48.130615.csv.

I would appreciate it if you could review. Regards.

PSNAppz commented 1 year ago

@fukusuket Thanks for noticing it and for the PR! @KingAkeem What do you think?

KingAkeem commented 1 year ago

This looks good to me, if you've tested it out and it's working then I'm fine with merging. If not, I'll check it out sometime this week.

fukusuket commented 1 year ago

@KingAkeem Thanks for the quick review :) I have tested as follows.

log level info(XMLParsedAsHTMLWarning was there before the fix. Therefore, it is irrelevant to this fix )

fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
Processing... |                                | 3/110/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Processing... |################################| 110/110
Data has been saved to /Users/fukusuke/Scripts/Python/TorBot/data/torbot_2023-01-18T00:44:46.687196.csv.
fukusuke@fukusukenoAir TorBot %

log level debug

fukusuke@fukusukenoAir TorBot % export LOG_LEVEL=debug
fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
18-Jan-23 00:48:47 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:48 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
18-Jan-23 00:48:48 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:48 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... |                                | 1/11018-Jan-23 00:48:48 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:49 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... |                                | 2/11018-Jan-23 00:48:49 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:50 - DEBUG - https://thehiddenwiki.org:443 "GET /blog/ HTTP/1.1" 200 None
Processing... |                                | 3/11018-Jan-23 00:48:50 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:51 - DEBUG - https://thehiddenwiki.org:443 "GET /feed/ HTTP/1.1" 200 None
/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Processing... |#                               | 4/11018-Jan-23 00:48:51 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:52 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... |#                               | 5/11018-Jan-23 00:48:52 - DEBUG - Starting new HTTP connection (1): torproject.org:80
18-Jan-23 00:48:53 - DEBUG - http://torproject.org:80 "GET / HTTP/1.1" 301 299
18-Jan-23 00:48:53 - DEBUG - Starting new HTTPS connection (1): www.torproject.org:443
18-Jan-23 00:48:53 - DEBUG - https://www.torproject.org:443 "GET / HTTP/1.1" 200 4362
Processing... |#                               | 6/11018-Jan-23 00:48:53 - DEBUG - Starting new HTTP connection (1): s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion:80
18-Jan-23 00:48:53 - DEBUG - HTTPConnectionPool(host='s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1290514b0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
18-Jan-23 00:48:53 - DEBUG - Failed to connect to [http://s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion/].
Processing... |##                              | 7/11
...

KingAkeem commented 1 year ago

@fukusuket

The XMLParsedAsHTMLWarning is from BeautifulSoup which was the HTML parser that was used before gotor. The data collection script hasn't replaced it with gotor yet. Gotor may need to be extended to support some additional functionality, but I haven't investigated yet.

I haven't tested this, but it looks good to me and I don't think it'll break anything.

fukusuket commented 1 year ago

Thank you for prompt review :)

DedSecInside / TorBot