ekalinin / robots.js

Parser for robots.txt for node.js
MIT License
66 stars 21 forks source link

Everything returns true... #17

Closed morganrallen closed 9 years ago

morganrallen commented 9 years ago

or I'm totally missing something.

Anyhow, here's the snippet I'm using.

var robots = require("robots");
var parser = new robots.RobotsParser();

parser.setUrl("http://slashdot.org", function(parser, success) {
  console.log("success: %s", success);

  parser.canFetch("DerpBot", "http://slashdot.org/zoo.pl", function(access) {
    console.log("access: %s", access);
  });
});

and a snippet of their robots.txt

User-agent: *
Crawl-delay: 1
Disallow: /authors.pl
Disallow: /index.pl
Disallow: /pollBooth.pl
Disallow: /pubkey.pl
Disallow: /topics.pl
Disallow: /zoo.pl
Disallow: /palm
Disallow: /slashdot-it.pl
Disallow: slashdot-it.pl
Disallow: authors.pl

and finally output

success: true
access: true
sebastianwessel commented 9 years ago

If you enable debug (/lib/utils.js) you will get more informations and you will see that no valid robots.txt will be returned - it's an html file So the parser won't get an robots.txt, which means no valid rules and access defaults to true Thats why it alway returns true....

sebastianwessel commented 9 years ago

...maybe you should point to robots.txt and not to domain only ;-)

ar robots = require("robots");
var parser = new robots.RobotsParser();

parser.setUrl("http://slashdot.org/robots.txt", function(parser, success) {
  console.log("success: %s", success);

  parser.canFetch("DerpBot", "http://slashdot.org/zoo.pl", function(access) {
    console.log("access: %s", access);
  });
});
ekalinin commented 9 years ago

Please, read this:

Your snippet should be:

var robots = require("robots");
var parser = new robots.RobotsParser();

parser.setUrl("http://slashdot.org/robots.txt", function(parser, success) {
  console.log("success: %s", success);

  parser.canFetch("DerpBot", "/zoo.pl", function(access) {
    console.log("access: %s", access);
  });
});

Here's my results:

➥ npm -g install robots
robots@0.9.4 ~/tmp/nodeenv/env/lib/node_modules/robots
➥ node test.js 
success: true
access: false