Closed vumaasha closed 10 years ago
@vumaasha, i believe as you must queue each wanted page, you must also test if the page is allowed on robots.txt by yourself.
...But i also believe that a method called robotsAllowed
receiving a uri, may get the domain's robots.txt
and answering me if the path is allowed or not, would be useful! Also about the robots meta, you may consider this coffee code:
c = new Crawler
getMetaRobots = ($)->
metaRobots = index: true, follow: true
for val in $('meta[name="robots"]').attr('content').toLowerCase().split(/,\s*/g)
if block=(val[0..1]=='no') then val=val[2..-1]
metaRobots[val] = not block # that allows new non-standard attributes
metaRobots
c.queue uri:'https://duckduckgo.com?q=html', callback:(err, result, $)->
metaRobots = getMetaRobots $
console.log 'May i index its content? ' + metaRobots.index
console.log 'May i follow its links? ' + metaRobots.follow
You must also consider the rel="nofollow"
in links.
Can close. This has been answered by @aurium
Hi, Does node-crawler obey the robots.txt exclusion standard, described at http://www.robotstxt.org/wc/exclusion.html#robotstxt and robots META tag, as described at http://www.robotstxt.org/wc/meta-user.html? If not how to achieve this with node-crawler ?