ScottMansfield / widow

Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26 stars 4 forks source link

Add support for robots.txt for any website #2

Open ScottMansfield opened 9 years ago

ScottMansfield commented 9 years ago

The robots.txt rules should survive restarts and be per-domain.

See http://www.robotstxt.org/robotstxt.html for some examples. I didn't find any standard java parsers onlien in a quick search, so maybe a custom parser is involved. It might be a separate project.

ScottMansfield commented 9 years ago

crawler-commons has a robots.txt parser:

https://github.com/crawler-commons/crawler-commons

ScottMansfield commented 9 years ago

This is being done in the terminator project

ScottMansfield commented 9 years ago

work done in 22b72fd9f0a43540a39896a8f9e3978a2358503c and 5af862347c29e5d215264b90f61c4bcedb97a62e