ciscocsirt / malspider

Malspider is a web spidering framework that detects characteristics of web compromises.
BSD 3-Clause "New" or "Revised" License
417 stars 78 forks source link

Tasks running indefinitely #13

Closed wrinkl3 closed 8 years ago

wrinkl3 commented 8 years ago

2016-10-13 00:00:19+0400 [scrapy] INFO: Scrapy 0.24.4 started (bot: full_domain) 2016-10-13 00:00:19+0400 [scrapy] INFO: Optional features available: ssl, http11, django 2016-10-13 00:00:19+0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'malspider.spiders', 'SPIDER_MODULES': ['malspider.spiders'], 'LOG_FILE': 'logs/malspider/full_domain/76f0156890b611e6959d005056ae7ab0.log', 'USER_AGENT': 'Mozilla/5.0 (Android; Tablet; rv:30.0) Gecko/30.0 Firefox/30.0', 'BOT_NAME': 'full_domain'} 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgentMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, WebdriverSpiderMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled item pipelines: DuplicateFilterPipeline, WhitelistFilterPipeline, MySQLPipeline 2016-10-13 00:00:19+0400 [full_domain] INFO: Spider opened 2016-10-13 00:00:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6083 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Downloading https://www.des.gov.ge with webdriver 2016-10-13 00:01:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:02:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:03:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:04:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:05:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:06:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:07:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:08:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:09:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:10:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:11:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:12:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:13:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:14:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:15:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:16:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

This goes on for 24 hours, at which point the task is sigkilled. Seems to happen to all the domains I try to monitor, but not every time. Any idea why this keeps happening?

jasheppa5 commented 8 years ago

Hey Alex,

I have a few ideas:

I think you have an outdated PhantomJS version. Let's start there. I was able to crawl the site in question just fine, minus a character encoding issue I'm working on...

Keep me updated. Let me know if updating PhantomJS works. If not we'll keep debugging.

-James

On Tue, Oct 18, 2016 at 4:02 AM, Alex Shatberashvili < notifications@github.com> wrote:

2016-10-13 00:00:19+0400 [scrapy] INFO: Scrapy 0.24.4 started (bot: full_domain) 2016-10-13 00:00:19+0400 [scrapy] INFO: Optional features available: ssl, http11, django 2016-10-13 00:00:19+0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'malspider.spiders', 'SPIDER_MODULES': ['malspider.spiders'], 'LOG_FILE': 'logs/malspider/full_domain/ 76f0156890b611e6959d005056ae7ab0.log', 'USER_AGENT': 'Mozilla/5.0 (Android; Tablet; rv:30.0) Gecko/30.0 Firefox/30.0', 'BOT_NAME': 'full_domain'} 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgentMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, WebdriverSpiderMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2016-10-13 00:00:19+0400 [scrapy] INFO: Enabled item pipelines: DuplicateFilterPipeline, WhitelistFilterPipeline, MySQLPipeline 2016-10-13 00:00:19+0400 [full_domain] INFO: Spider opened 2016-10-13 00:00:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6083 2016-10-13 00:00:19+0400 [scrapy] DEBUG: Downloading https://www.des.gov.ge with webdriver 2016-10-13 00:01:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:02:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:03:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:04:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:05:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:06:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:07:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:08:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:09:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:10:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:11:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:12:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:13:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:14:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:15:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-13 00:16:19+0400 [full_domain] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

This goes on for 24 hours, at which point the task is sigkilled. Any idea why this keeps happening?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AR0QEMy6cA5S_xCYkrQf3qInNjLF4Bmzks5q1H0PgaJpZM4KZfGq .

wrinkl3 commented 8 years ago

James,

I'm happy to report that PhantomJS was indeed behind this. For some reason, the install_dependencies script set up the 1.9 version, so I ended up reinstalling it via npm. Are there any tradeoffs between the standalone version and the npm version, in regards to malspider?

jasheppa5 commented 8 years ago

Hey Alex,

I'm glad Malspider is working for you now. I don't think there is a difference (or advantage) to using the standalone version vs the npm version. I install directly from the PhantomJS author's bitbucket page ( https://bitbucket.org/ariya/phantomjs/) and manually add phantomjs to /usr/bin/. I don't immediately remember why I did it this way, but I think it had something to do with package managers not adding phantomjs to the user's PATH correctly or not updating older phantomjs instances properly.

On Thu, Oct 27, 2016 at 4:00 AM, Alex Shatberashvili < notifications@github.com> wrote:

James,

I'm happy to report that PhantomJS was indeed behind this. For some reason, the install_dependencies script set up the 1.9 version, so I ended up reinstalling it via npm. Are there any tradeoffs between the standalone version and the npm version, in regards to malspider?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/13#issuecomment-256573823, or mute the thread https://github.com/notifications/unsubscribe-auth/AR0QEObD1n_Ho9IOrP4hhRUvx3Qg_7x-ks5q4FomgaJpZM4KZfGq .