CIRCL / AIL-framework

AIL framework - Analysis Information Leak framework. Project moved to https://github.com/ail-project
https://github.com/ail-project/ail-framework
GNU Affero General Public License v3.0
1.29k stars 283 forks source link

Splash Servers down Error #512

Closed annetteshajan closed 4 years ago

annetteshajan commented 4 years ago

Hello team I've been struggling for a couple of days with running the crawlers. Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1, it shows:

annetteshajan commented 4 years ago

I installed the sscapinghub/splash outside the environment and ran it from docker With this, regular websites can be crawled. I'm still facing an issue with onion URLs Kindly request you to help @Terrtia @mokaddem @adulau

annetteshajan commented 4 years ago

When I crawl a particular onion site, this is output from Crawler.py

------------------START CRAWLER------------------ crawler type: onion

url: http://zzq7gpluliw6iq7l.onion/threadlist.php? domain: zzq7gpluliw6iq7l.onion domain_url: http://zzq7gpluliw6iq7l.onion

Launching Crawler: http://zzq7gpluliw6iq7l.onion 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot) 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 18 2020, 01:56:04) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1022-azure-x86_64-with-Ubuntu-18.04-bionic 2020-06-09 16:59:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2020-06-09 16:59:57 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_PAGECOUNT': 50, 'DEPTH_LIMIT': 1, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',

2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-06-09 16:59:57 [scrapy.core.engine] INFO: Spider opened 2020-06-09 16:59:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-06-09 16:59:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-06-09 16:59:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://zzq7gpluliw6iq7l.onion via http://0.0.0.0:8050/execute> (referer: None) 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Closing spider (finished) 2020-06-09 16:59:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1993, 'downloader/request_count': 1, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.148073, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 6, 9, 16, 59, 58, 124088), 'log_count/DEBUG': 1, 'log_count/INFO': 10, 'memusage/max': 70549504, 'memusage/startup': 70549504, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'splash/execute/request_count': 1, 'splash/execute/response_count/200': 1, 'start_time': datetime.datetime(2020, 6, 9, 16, 59, 57, 976015)} 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Spider closed (finished) network3

Terrtia commented 4 years ago

Hi @annetteshajan ! You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges : sudo screen -ls

Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1,

You need to specify the complete path of the proxy settings:

sudo ./bin/torcrawler/launch_splash_crawler.sh -f /home/<my_user>/<path_to_ail>/ail-framework/configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1

With this, regular websites can be crawled. I'm still facing an issue with onion URLs

If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.

annetteshajan commented 4 years ago

I got it by running the Splash server local to the system and not on the AIL environment.

If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.

On scrapinghub/splash documentation it says that default.ini needs to exist at /etc/splash/proxy-profiles/ ...so I just moved my file to that location because I couldn't figure out how to change the location with the proxy file.

sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/ scrapinghub/splash

I used this command. Now onion sites are getting crawled. I will try to do it with the AIL environment again.

You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges : sudo screen -ls

You were right about this! Thanks for the help

Hope my method helps anybody with same issues