ciscocsirt / malspider

Malspider is a web spidering framework that detects characteristics of web compromises.
BSD 3-Clause "New" or "Revised" License
420 stars 78 forks source link

A problem in full_domain_spider #4

Closed xaled closed 8 years ago

xaled commented 8 years ago

I think there is a problem with this code in full_domain_spider:

for link in LxmlLinkExtractor(unique=True).extract_links(response):
    if not response.url in self.already_crawled:
        self.already_crawled.add(link.url)
        yield WebdriverRequest(link.url, callback=self.parse_item)
    else:
        print "avoiding request for: ", response.url

When yielding requests for scrapy, the spider tests response.url if it is already crawled instead of link.url.

I think the code should be:

for link in LxmlLinkExtractor(unique=True).extract_links(response):
    if not link.url in self.already_crawled:
        self.already_crawled.add(link.url)
        yield WebdriverRequest(link.url, callback=self.parse_item)
    else:
        print "avoiding request for: ", link.url
jasheppa5 commented 8 years ago

Thank you for pointing this out! You absolutely correct. I updated the code and pushed the changes to github.