Closed annetteshajan closed 4 years ago
I installed the sscapinghub/splash outside the environment and ran it from docker With this, regular websites can be crawled. I'm still facing an issue with onion URLs Kindly request you to help @Terrtia @mokaddem @adulau
When I crawl a particular onion site, this is output from Crawler.py
url: http://zzq7gpluliw6iq7l.onion/threadlist.php? domain: zzq7gpluliw6iq7l.onion domain_url: http://zzq7gpluliw6iq7l.onion
Launching Crawler: http://zzq7gpluliw6iq7l.onion 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot) 2020-06-09 16:59:57 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 18 2020, 01:56:04) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1022-azure-x86_64-with-Ubuntu-18.04-bionic 2020-06-09 16:59:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2020-06-09 16:59:57 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_PAGECOUNT': 50, 'DEPTH_LIMIT': 1, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-06-09 16:59:57 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-06-09 16:59:57 [scrapy.core.engine] INFO: Spider opened 2020-06-09 16:59:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-06-09 16:59:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-06-09 16:59:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://zzq7gpluliw6iq7l.onion via http://0.0.0.0:8050/execute> (referer: None) 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Closing spider (finished) 2020-06-09 16:59:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1993, 'downloader/request_count': 1, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.148073, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 6, 9, 16, 59, 58, 124088), 'log_count/DEBUG': 1, 'log_count/INFO': 10, 'memusage/max': 70549504, 'memusage/startup': 70549504, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'splash/execute/request_count': 1, 'splash/execute/response_count/200': 1, 'start_time': datetime.datetime(2020, 6, 9, 16, 59, 57, 976015)} 2020-06-09 16:59:58 [scrapy.core.engine] INFO: Spider closed (finished) network3
Hi @annetteshajan !
You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges : sudo screen -ls
Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1,
You need to specify the complete path of the proxy settings:
sudo ./bin/torcrawler/launch_splash_crawler.sh -f /home/<my_user>/<path_to_ail>/ail-framework/configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1
With this, regular websites can be crawled. I'm still facing an issue with onion URLs
If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.
I got it by running the Splash server local to the system and not on the AIL environment.
If you pass the complete path of the proxy configuration, you should be able to crawl any onions address.
On scrapinghub/splash documentation it says that default.ini needs to exist at /etc/splash/proxy-profiles/ ...so I just moved my file to that location because I couldn't figure out how to change the location with the proxy file.
sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/ scrapinghub/splash
I used this command. Now onion sites are getting crawled. I will try to do it with the AIL environment again.
You get this message because a screen is already launched. The Docker_Splash screen is launched as root, you need to have root privileges :
sudo screen -ls
You were right about this! Thanks for the help
Hope my method helps anybody with same issues
Hello team I've been struggling for a couple of days with running the crawlers. Every time I run sudo ./bin/torcrawler/launch_splash_crawler.sh -f configs/docker/splash_onion/etc/splash/proxy-profiles/ -p 8050 -n 1, it shows: