Closed fukusuket closed 1 year ago
@fukusuket Thanks for noticing it and for the PR! @KingAkeem What do you think?
This looks good to me, if you've tested it out and it's working then I'm fine with merging. If not, I'll check it out sometime this week.
@KingAkeem Thanks for the quick review :) I have tested as follows.
log level info(XMLParsedAsHTMLWarning
was there before the fix. Therefore, it is irrelevant to this fix )
fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
Processing... | | 3/110/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
warnings.warn(
Processing... |################################| 110/110
Data has been saved to /Users/fukusuke/Scripts/Python/TorBot/data/torbot_2023-01-18T00:44:46.687196.csv.
fukusuke@fukusukenoAir TorBot %
log level debug
fukusuke@fukusukenoAir TorBot % export LOG_LEVEL=debug
fukusuke@fukusukenoAir TorBot % poetry run python run.py --gather
Gathering data for https://thehiddenwiki.org
18-Jan-23 00:48:47 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:48 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
18-Jan-23 00:48:48 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:48 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... | | 1/11018-Jan-23 00:48:48 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:49 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... | | 2/11018-Jan-23 00:48:49 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:50 - DEBUG - https://thehiddenwiki.org:443 "GET /blog/ HTTP/1.1" 200 None
Processing... | | 3/11018-Jan-23 00:48:50 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:51 - DEBUG - https://thehiddenwiki.org:443 "GET /feed/ HTTP/1.1" 200 None
/Users/fukusuke/Library/Caches/pypoetry/virtualenvs/torbot-xIJhVHKw-py3.10/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
warnings.warn(
Processing... |# | 4/11018-Jan-23 00:48:51 - DEBUG - Starting new HTTPS connection (1): thehiddenwiki.org:443
18-Jan-23 00:48:52 - DEBUG - https://thehiddenwiki.org:443 "GET / HTTP/1.1" 200 None
Processing... |# | 5/11018-Jan-23 00:48:52 - DEBUG - Starting new HTTP connection (1): torproject.org:80
18-Jan-23 00:48:53 - DEBUG - http://torproject.org:80 "GET / HTTP/1.1" 301 299
18-Jan-23 00:48:53 - DEBUG - Starting new HTTPS connection (1): www.torproject.org:443
18-Jan-23 00:48:53 - DEBUG - https://www.torproject.org:443 "GET / HTTP/1.1" 200 4362
Processing... |# | 6/11018-Jan-23 00:48:53 - DEBUG - Starting new HTTP connection (1): s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion:80
18-Jan-23 00:48:53 - DEBUG - HTTPConnectionPool(host='s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1290514b0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
18-Jan-23 00:48:53 - DEBUG - Failed to connect to [http://s4k4ceiapwwgcm3mkb6e4diqecpo7kvdnfr5gg7sph7jjppqkvwwqtyd.onion/].
Processing... |## | 7/11
...
@fukusuket
The XMLParsedAsHTMLWarning
is from BeautifulSoup which was the HTML parser that was used before gotor. The data collection script hasn't replaced it with gotor yet. Gotor may need to be extended to support some additional functionality, but I haven't investigated yet.
I haven't tested this, but it looks good to me and I don't think it'll break anything.
Thank you for prompt review :)
Hello, thank you for maintaining the tool :)
When crawling URLs with the
--gather
option, you may not be sure that all URLs are reachable. Therefore, I changed it to continue crawling other URLs even if there is a connection error.Changes Proposed
--gather
option.Failed to connect to [url] .
Explanation of Changes
When executing with
--gather
option, if there is even one URL that cannot be connected, the program will exit with an exception as follows.Screenshots of new feature/change
This PR outputs the URL that had the connection error and continues further processing as follows.
I would appreciate it if you could review. Regards.