ScottMansfield / widow

Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26 stars 4 forks source link

Check for robots meta tag while parsing a page #15

Open ScottMansfield opened 9 years ago

ScottMansfield commented 9 years ago

If nofollow then don't enqueue pages to fetch.

if noindex then don't index the current page.

default to index, follow, meaning index the current page and send link to the parse stage to process.

Need to comma-separate list and parse the list if it exists. Whitespace will be ignored.

ScottMansfield commented 9 years ago

The nofollow also implies no HEAD requests to check for Content-Type