bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.69k stars 876 forks source link

Does node-crawler honor robots.txt #63

Closed vumaasha closed 10 years ago

vumaasha commented 11 years ago

Hi, Does node-crawler obey the robots.txt exclusion standard, described at http://www.robotstxt.org/wc/exclusion.html#robotstxt and robots META tag, as described at http://www.robotstxt.org/wc/meta-user.html? If not how to achieve this with node-crawler ?

aurium commented 10 years ago

Update links: http://www.robotstxt.org/orig.html#examples and http://www.robotstxt.org/meta.html

aurium commented 10 years ago

@vumaasha, i believe as you must queue each wanted page, you must also test if the page is allowed on robots.txt by yourself.

...But i also believe that a method called robotsAllowed receiving a uri, may get the domain's robots.txt and answering me if the path is allowed or not, would be useful! Also about the robots meta, you may consider this coffee code:

c = new Crawler

getMetaRobots = ($)->
    metaRobots = index: true, follow: true
    for val in $('meta[name="robots"]').attr('content').toLowerCase().split(/,\s*/g)
        if block=(val[0..1]=='no') then val=val[2..-1]
        metaRobots[val] = not block # that allows new non-standard attributes
    metaRobots

c.queue uri:'https://duckduckgo.com?q=html', callback:(err, result, $)->
    metaRobots = getMetaRobots $
    console.log 'May i index its content? ' + metaRobots.index
    console.log 'May i follow its links? ' + metaRobots.follow

You must also consider the rel="nofollow" in links.

digitalfrost commented 10 years ago

Can close. This has been answered by @aurium