ciscocsirt / malspider

Malspider is a web spidering framework that detects characteristics of web compromises.
BSD 3-Clause "New" or "Revised" License
419 stars 78 forks source link

Limit Scrapy Depth #6

Closed r3comp1le closed 8 years ago

r3comp1le commented 8 years ago

Is there a way to limit the depth and recursion that happens per organization? If I just want to scrap initial homepages for quick checkups.

79617261 commented 8 years ago

+1 to crawl depth

jasheppa5 commented 8 years ago

You can comment out these lines in full_domain_spider.py for a quick fix:

for link in LxmlLinkExtractor(unique=True).extract_links(response): if not link.url in self.already_crawled: self.already_crawled.add(link.url) print "crawling: ", link.url yield WebdriverRequest(link.url, callback=self.parse_item) else: print "avoiding request for: ", link.url

A better solution is to create another spider within the malspider project and use a different ruleset that sets the follow flag to False. In the future I'd like to provide an option in malspider to crawl only a single page or fork the project and build a version that only crawls one page.

r3comp1le commented 8 years ago

thanks, that def speeded things up for me

jasheppa5 commented 8 years ago

Thanks for the feedback. I added functionality to limit the number of pages crawled per domain. The updated code includes a "PAGES_PER_DOMAIN" variable in the malspider/settings.py file that can be set to "0" to crawl only the homepage or to another number "x" to crawl "x" pages beyond the homepage.

jasheppa5 commented 8 years ago

As a side note, Malspider now crawls 20 pages per domain by default rather than the entire site.