Splash Servers down Error

annetteshajan commented 4 years ago

Hello team I've been struggling for a couple of days with running the crawlers. Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1, it shows:

A screen is already launched, please kill it before creating another one. When I check the screen for Crawler_AIL it shows that splash docker is down Everytime I upload an address through the manual crawler also, it shows that Splash Server is down Do you have any fix? @Terrtia @mokaddem Thank you

annetteshajan commented 4 years ago

I installed the sscapinghub/splash outside the environment and ran it from docker With this, regular websites can be crawled. I'm still facing an issue with onion URLs Kindly request you to help @Terrtia @mokaddem @adulau

annetteshajan commented 4 years ago

When I crawl a particular onion site, this is output from Crawler.py

------------------START CRAWLER------------------ crawler type: onion

url: http://zzq7gpluliw6iq7l.onion/threadlist.php? domain: zzq7gpluliw6iq7l.onion domain_url: http://zzq7gpluliw6iq7l.onion

Launching Crawler: http://zzq7gpluliw6iq7l.onion 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot) 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 18 2020, 01:56:04) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1022-azure-x86_64-with-Ubuntu-18.04-bionic 2020-06-09 16:59:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2020-06-09 16:59:57 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_PAGECOUNT': 50, 'DEPTH_LIMIT': 1, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',

2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-06-09 16:59:57 [scrapy.core.engine] INFO: Spider opened 2020-06-09 16:59:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-06-09 16:59:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-06-09 16:59:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://zzq7gpluliw6iq7l.onion via http://0.0.0.0:8050/execute> (referer: None) 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Closing spider (finished) 2020-06-09 16:59:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1993, 'downloader/request_count': 1, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.148073, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 6, 9, 16, 59, 58, 124088), 'log_count/DEBUG': 1, 'log_count/INFO': 10, 'memusage/max': 70549504, 'memusage/startup': 70549504, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'splash/execute/request_count': 1, 'splash/execute/response_count/200': 1, 'start_time': datetime.datetime(2020, 6, 9, 16, 59, 57, 976015)} 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Spider closed (finished) network3

Terrtia commented 4 years ago

Hi @annetteshajan ! You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges : sudo screen -ls

Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1,

You need to specify the complete path of the proxy settings:

sudo ./bin/torcrawler/launch_splash_crawler.sh -f /home/<my_user>/<path_to_ail>/ail-framework/configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1

With this, regular websites can be crawled. I'm still facing an issue with onion URLs

If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.

network3: the remote host name was not found (invalid hostname)

annetteshajan commented 4 years ago

I got it by running the Splash server local to the system and not on the AIL environment.

If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.

On scrapinghub/splash documentation it says that default.ini needs to exist at /etc/splash/proxy-profiles/ ...so I just moved my file to that location because I couldn't figure out how to change the location with the proxy file.

sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/ scrapinghub/splash

I used this command. Now onion sites are getting crawled. I will try to do it with the AIL environment again.

You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges : sudo screen -ls

You were right about this! Thanks for the help

Hope my method helps anybody with same issues

CIRCL / AIL-framework

Splash Servers down Error #512

------------------START CRAWLER------------------ crawler type: onion