Closed r3comp1le closed 8 years ago
+1 to crawl depth
You can comment out these lines in full_domain_spider.py for a quick fix:
for link in LxmlLinkExtractor(unique=True).extract_links(response): if not link.url in self.already_crawled: self.already_crawled.add(link.url) print "crawling: ", link.url yield WebdriverRequest(link.url, callback=self.parse_item) else: print "avoiding request for: ", link.url
A better solution is to create another spider within the malspider project and use a different ruleset that sets the follow flag to False. In the future I'd like to provide an option in malspider to crawl only a single page or fork the project and build a version that only crawls one page.
thanks, that def speeded things up for me
Thanks for the feedback. I added functionality to limit the number of pages crawled per domain. The updated code includes a "PAGES_PER_DOMAIN" variable in the malspider/settings.py file that can be set to "0" to crawl only the homepage or to another number "x" to crawl "x" pages beyond the homepage.
As a side note, Malspider now crawls 20 pages per domain by default rather than the entire site.
Is there a way to limit the depth and recursion that happens per organization? If I just want to scrap initial homepages for quick checkups.