Closed wrinkl3 closed 8 years ago
Hey Alex,
I have a few ideas:
I think you have an outdated PhantomJS version. Let's start there. I was able to crawl the site in question just fine, minus a character encoding issue I'm working on...
Keep me updated. Let me know if updating PhantomJS works. If not we'll keep debugging.
-James
On Tue, Oct 18, 2016 at 4:02 AM, Alex Shatberashvili < notifications@github.com> wrote:
2016-10-13 00:00:19+0400 [scrapy] INFO: Scrapy 0.24.4 started (bot: full_domain) 2016-10-13 00:00:19+0400 [scrapy] INFO: Optional features available: ssl, http11, django 2016-10-13 00:00:19+0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'malspider.spiders', 'SPIDER_MODULES': ['malspider.spiders'], 'LOG_FILE': 'logs/malspider/full_domain/ 76f0156890b611e6959d005056ae7ab0.log', 'USER_AGENT': 'Mozilla/5.0 (Android; Tablet; rv:30.0) Gecko/30.0 Firefox/30.0', 'BOT_NAME': 'full_domain'} 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgentMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, WebdriverSpiderMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled item pipelines: DuplicateFilterPipeline, WhitelistFilterPipeline, MySQLPipeline 2016-10-13 00:00:19+0400 [full_domain] INFO: Spider opened 2016-10-13 00:00:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6083 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Downloading https://www.des.gov.ge with webdriver 2016-10-13 00:01:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:02:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:03:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:04:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:05:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:06:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:07:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:08:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:09:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:10:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:11:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:12:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:13:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:14:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:15:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:16:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
This goes on for 24 hours, at which point the task is sigkilled. Any idea why this keeps happening?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AR0QEMy6cA5S_xCYkrQf3qInNjLF4Bmzks5q1H0PgaJpZM4KZfGq .
James,
I'm happy to report that PhantomJS was indeed behind this. For some reason, the install_dependencies script set up the 1.9 version, so I ended up reinstalling it via npm. Are there any tradeoffs between the standalone version and the npm version, in regards to malspider?
Hey Alex,
I'm glad Malspider is working for you now. I don't think there is a difference (or advantage) to using the standalone version vs the npm version. I install directly from the PhantomJS author's bitbucket page ( https://bitbucket.org/ariya/phantomjs/) and manually add phantomjs to /usr/bin/. I don't immediately remember why I did it this way, but I think it had something to do with package managers not adding phantomjs to the user's PATH correctly or not updating older phantomjs instances properly.
On Thu, Oct 27, 2016 at 4:00 AM, Alex Shatberashvili < notifications@github.com> wrote:
James,
I'm happy to report that PhantomJS was indeed behind this. For some reason, the install_dependencies script set up the 1.9 version, so I ended up reinstalling it via npm. Are there any tradeoffs between the standalone version and the npm version, in regards to malspider?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/13#issuecomment-256573823, or mute the thread https://github.com/notifications/unsubscribe-auth/AR0QEObD1n_Ho9IOrP4hhRUvx3Qg_7x-ks5q4FomgaJpZM4KZfGq .
This goes on for 24 hours, at which point the task is sigkilled. Seems to happen to all the domains I try to monitor, but not every time. Any idea why this keeps happening?