collective / transmogrify.webcrawler

transmogrifier source blueprints for crawling html
9 stars 5 forks source link

Skip robots.txt rules? #1

Closed jmansilla closed 13 years ago

jmansilla commented 13 years ago

I needed to crawl some pages that, I don't know why, robots.txt was saying that shouldn't be crawled.

I made some changes on your package for allowing an option "laugh_robots" where you can list paths that will be removed from robots.

Is this something that you think can be interesting for transmogrify.webcrawler?

djay commented 13 years ago

yes I think thats a good idea. Does it support globs?

I'm also about to move this package to the github collective so you'll be able to make these changes directly.