CIRCL / AIL-framework

AIL framework - Analysis Information Leak framework. Project moved to https://github.com/ail-project
https://github.com/ail-project/ail-framework
GNU Affero General Public License v3.0
1.31k stars 285 forks source link

Web pages "down" when crawling (Crawler_AIL exception) #580

Closed alexisteleco closed 2 years ago

alexisteleco commented 2 years ago

(Not sure whether this is related to issue #490)

Context

I installed and configured AIL, Ail splash manager, Tor, i2P and squid, but all crawled domains seems to be "down", when they really aren't. I use the tor proxy provided by The torproject.

As far as I know, Tor is working as expected:

curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/ | cat | grep -m 1 Congratulations | xargs
# Congratulations. This browser is configured to use Tor.

Also the squid proxy works

curl --proxy "http://172.17.0.1:3128" "http://google.it"                                                                          
#<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
#<TITLE>301 Moved</TITLE></HEAD><BODY>
#<H1>301 Moved</H1>
#The document has moved
#<A HREF="http://www.google.it/">here</A>.
#</BODY></HTML>

Configurations can be seen here:

ail-splash-manager/config/proxies_profiles.cfg ```cfg # Tor: torrc default proxy # use The torproject proxy https://2019.www.torproject.org/docs/debian # (up to date, solve issues with v3 onion addresses) # proxy name [default_tor] # proxy host host=172.17.0.1 # proxy port port=9050 # proxy type type=SOCKS5 # proxy description description=tor default proxy # crawler type (tor or i2p or web) crawler_type=tor # SQUID proxy [web_proxy] host=172.17.0.1 port=3128 type=HTTP description=web proxy crawler_type=web # I2P proxy [default_i2p] host=127.0.0.1 port=4444 type=HTTP description=i2p default proxy crawler_type=i2p ```
ail-splash-manager/config/containers.cfg ```cfg # proxy_name: proxy name (defined in proxies_profiles.cfg) # port: single port or port range (ex: 8050 or 8050-8052) # cpu: max number of cpu allocated # memory: RAM (G) allocated # maxrss: max unbound in-memory cache (Mb, Restart Splash when full) # description: docker description [default_splash_tor] proxy_name=default_tor port=8050-8052 cpu=1 memory=1 maxrss=2000 description= default splash tor net=bridge # Splash with SQUID proxy [web_splash] proxy_name=web_proxy port=8060 cpu=1 memory=1 maxrss=2000 description= web splash (squid) net=bridge # Splash with I2P proxy [default_splash_i2p] # section name: splash name proxy_name=default_i2p port=8053-8055 cpu=1 memory=1 maxrss=2000 description=default splash i2p net=host ```
/etc/squid/conf.d/ail.conf ```conf # # Squid configuration settings for Ail # acl localnet src 172.17.0.0/16 # Docker IP range http_access allow localnet ```

Errors

With onion sites (using Tor)

on the Crawler_AIL screen This happens when I run a web crawler trough the web GUI ``` ------------------START CRAWLER------------------ crawler type: onion ------------------------------------------------- url: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion domain: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion domain_url: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion Launching Crawler: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.onion 2022-04-19 15:17:43 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot) 2022-04-19 15:17:43 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29 2022-04-19 15:17:43 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_PAGECOUNT': 1, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 ' 'Firefox/78.0'} 2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx 2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-04-19 15:17:43 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-04-19 15:17:43 [scrapy.core.engine] INFO: Spider opened 2022-04-19 15:17:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-04-19 15:17:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-04-19 15:17:43 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead. url = to_native_str(url) 2022-04-19 15:18:10 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2022-04-19 15:18:10 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount) 2022-04-19 15:18:10 [scrapy.core.scraper] ERROR: Spider error processing (referer: None) Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback yield next(it) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__ return next(self.data) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__ return next(self.data) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output for el in result: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in return (_set_referer(r) for r in result or ()) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in return (r for r in result or () if _filter(r)) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port) File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd for pair in mapping.items(): AttributeError: 'int' object has no attribute 'items' 2022-04-19 15:18:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2194, 'downloader/request_count': 1, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 335359, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 26.840496, 'finish_reason': 'closespider_pagecount', 'finish_time': datetime.datetime(2022, 4, 19, 15, 18, 10, 312840), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'memusage/max': 84561920, 'memusage/startup': 84561920, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'spider_exceptions/AttributeError': 1, 'splash/execute/request_count': 1, 'splash/execute/response_count/200': 1, 'start_time': datetime.datetime(2022, 4, 19, 15, 17, 43, 472344)} 2022-04-19 15:18:10 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount) 2022-04-19 15:18:10 [scrapy.core.engine] INFO: Error while scheduling new request Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit result = next(self._iterator) StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc] File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in d.addBoth(lambda _: self.slot.nextcall.schedule()) AttributeError: 'NoneType' object has no attribute 'nextcall' ```

Crawling regular sites (using squid)

In the Crawler_AIL screen ``` SPLASH Name: web_splash Proxy Name: web_proxy Crawler Type: web deque(['web_splash', 'regular']) splash url: 127.0.0.1:8060 ------------------START CRAWLER------------------ crawler type: regular ------------------------------------------------- url: http://google.co.jp domain: google.co.jp domain_url: http://google.co.jp Launching Crawler: http://google.co.jp 2022-04-19 15:09:24 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot) 2022-04-19 15:09:24 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.8.10 (default, Mar 15 2022, 12:22:08) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Linux-5.13.0-1021-aws-x86_64-with-glibc2.29 2022-04-19 15:09:24 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_PAGECOUNT': 1, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 ' 'Firefox/78.0'} 2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxxxxxxxxxxxx 2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-04-19 15:09:24 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-04-19 15:09:24 [scrapy.core.engine] INFO: Spider opened 2022-04-19 15:09:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-04-19 15:09:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-04-19 15:09:24 [py.warnings] WARNING: /opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/request.py:42: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead. url = to_native_str(url) 2022-04-19 15:09:36 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2022-04-19 15:09:36 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount) 2022-04-19 15:09:36 [scrapy.core.scraper] ERROR: Spider error processing (referer: None) Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/defer.py", line 132, in iter_errback yield next(it) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__ return next(self.data) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/utils/python.py", line 354, in __next__ return next(self.data) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy_splash/middleware.py", line 162, in process_spider_output for el in result: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in return (_set_referer(r) for r in result or ()) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in return (r for r in result or () if _filter(r)) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/opt/ail-framework/bin/torcrawler/TorSplashCrawler.py", line 251, in parse crawlers.add_domain_root_item(item_id, self.domain_type, self.domains[0], self.date_epoch, self.port) File "/opt/ail-framework/bin/lib/crawlers.py", line 808, in add_domain_root_item r_serv_onion.zadd('crawler_history_{}:{}:{}'.format(domain_type, domain, port), epoch_date, root_item) File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/redis/commands/core.py", line 3920, in zadd for pair in mapping.items(): AttributeError: 'int' object has no attribute 'items' 2022-04-19 15:09:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2144, 'downloader/request_count': 1, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 3220010, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 11.858315, 'finish_reason': 'closespider_pagecount', 'finish_time': datetime.datetime(2022, 4, 19, 15, 9, 36, 401577), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'memusage/max': 84520960, 'memusage/startup': 84520960, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'spider_exceptions/AttributeError': 1, 'splash/execute/request_count': 1, 'splash/execute/response_count/200': 1, 'start_time': datetime.datetime(2022, 4, 19, 15, 9, 24, 543262)} 2022-04-19 15:09:36 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount) 2022-04-19 15:09:36 [scrapy.core.engine] INFO: Error while scheduling new request Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit result = next(self._iterator) StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc] File "/opt/ail-framework/AILENV/lib/python3.8/site-packages/scrapy/core/engine.py", line 187, in d.addBoth(lambda _: self.slot.nextcall.schedule()) AttributeError: 'NoneType' object has no attribute 'nextcall' ```

How can I solve the errors or at least continue the debugging?

Thanks in advance

alexisteleco commented 2 years ago

It is solved!

The problem was a bug in scrapy introduced after 2.5.1 version, and should be solved in version 2.6.2 as mentioned here.

At the time being right now (22 days after that comment), the 2.6.2 version is not available yet via pip install scrapy==2.6.2, so the workaround until that version is available can be either download directly from the branch o downgrade to 2.5.1 version. The downgrade procedure can be seen below.

# Don't forget to activate the AIL virtual environment
cd ail-framework/AILENV/bin/ && source activate
pip install scrapy==2.5.1
#Just in case you had the previous one
pip install scrapy-splash=0.8.0