ekalinin / robots.js

Parser for robots.txt for node.js
MIT License
66 stars 21 forks source link

Don't return first rule match for canFetch #18

Open sebastianwessel opened 9 years ago

sebastianwessel commented 9 years ago

Currently the first matching rule will be returned , but I don't think it's an good idea.

For example this will always be true: User-Agent: * Allow: / Disallow: /admin/ Disallow: /redirect/

change /lib/entry.js

Entry.prototype.allowance = function(url) {
  ut.d('* Entry.allowance, url: '+url);
  var ret=true;
  for (var i = 0, len = this.rules.length, rule; i < len; i++) {
    rule = this.rules[i];

    if ( rule.appliesTo(url) ) {
      ret= rule.allowance;
    }
  };

  return ret;
};

...this will return the last matching rule

srmor commented 9 years ago

This is a big issue, but I'm not totally sure returning the last rule to match really is the fix. Is there any way to determine whether a rule is more specific because ultimately we want to get the most specific rule.

spinatmensch commented 9 years ago

...it depends on how you like to interpret rules. In most ACL-cases you write something like: Disallow something Explicit allow some specific thing Disallow some more specific thing which normally will be allowed by rule before

Or reverse-case: Allow all Disallow a specific thing Allow a more specific thing which normally will be dissallowed by rule before

For robots.txt there is normally no official "allow" command - only a "dissallow" command is standard command. So a robots.txt should normally only contain "dissallow" commands to ensure correct interpretation. So the current "return on first matching rule" is correct and fastes way if you only have "disallow" commands in robots.txt or only respect "dissallow" command. But most big search engines are interpreting an "allow" command also to be able to crawl more sites. And in that case the last matching command rules, because it is always the most specific rule - see samples above.

And remember: robots.txt is NOT a "you should not crawl"-command, it's more "please, don't crawl" or "crawling of... Is not necessary"

So in my eyes it's up to creator of ACL to ensure correct order of rules and use of "allow" and there is no way to determinate a "more specific rule" - it's like army: "last order rules" if you respect "allow" command.

srmor commented 9 years ago

Very true... but I guess it really depends whether you want it to be able to accurately interpret all robots.txt or just ones that strictly follow the spec (practically none of them).

ghost commented 7 years ago

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#order-of-precedence-for-group-member-records