binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.47k stars 3.69k forks source link

Full scraping #recursively #129

Closed vortex14 closed 9 years ago

vortex14 commented 9 years ago

How to get around the site until it has links? Can show a simple example?

binux commented 9 years ago

I can't understand, could you please explain it more specific?

vortex14 commented 9 years ago

From Example Scrapy frameworks: rules = [Rule(LinkExtractor(allow=['/*']), callback="parse_page", follow=True)]

*follow=True — follow until there are links to pages.

How I can do with pyspider too?

binux commented 9 years ago
    def parse_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.parse_page)
vortex14 commented 9 years ago

Thanks