amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

Follow Google example of giving Allow directives higher match weight than Disallow directives #21

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
According to Wikipedia, which references this article 
(http://blog.semetrical.com/googles-secret-approach-to-robots-txt/), 
"...Google's implementation differs in that Allow patterns with equal or more 
characters in the directive path win over a matching Disallow pattern.[18] Bing 
uses the Allow or Disallow directive which is the most specific.[8]"

See also 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt 
for details on how Google interprets robots.txt files.

Original issue reported on code.google.com by kkrugler...@transpac.com on 17 Mar 2013 at 6:41

GoogleCodeExporter commented 8 years ago
if you change SimpleRobotRules.java

 protected class RobotRule implements Comparable<RobotRule>{
        String _prefix;
        Pattern _pattern;
        boolean _allow;

        public RobotRule(String prefix, boolean allow) {
            _prefix = prefix;
            _pattern = null;
            _allow = allow;
        }

        public RobotRule(Pattern pattern, boolean allow) {
            _prefix = null;
            _pattern = pattern;
            _allow = allow;
        }

        @Override
        public int compareTo(RobotRule o) {
            if(this._allow == o._allow)
                return 0;
            else if(this._allow && !o._allow)
                return 1;
            else
                return -1;
        }
    }

And change Rule collections 
private TreeSet<RobotRule> _rules

this problems be fixed.

Dissalow height prioritet and this Rule go first. 

Original comment by y.vladim...@semrush.com on 24 Oct 2013 at 12:01

GoogleCodeExporter commented 8 years ago
The above changes would put allow rules before disallow rules, but the Google 
implementation has an additional condition, where the "allow before disallow" 
heuristic is only triggered if the allow pattern has equal or more characters 
in the path when compared to a disallow path.

So if my allow rule was /dir and there was also a disallow rule with 
/dir/subdir, and both matched, then the disallow would win.

Original comment by kkrugler...@transpac.com on 25 Oct 2013 at 12:20

GoogleCodeExporter commented 8 years ago
Hmm, so if we sorted by prefix length first (longer goes first), and then by 
allow before disallow, I think we'd mostly get the implementation right.

Original comment by kkrugler...@transpac.com on 25 Oct 2013 at 12:33

GoogleCodeExporter commented 8 years ago
Ok

  @Override
        public int compareTo(RobotRule o) {
            if(_prefix.length()>o._prefix.length())
                return 1;
            else if(_prefix.length()<o._prefix.length())
                return -1;
            else if(this._allow == o._allow)
                return 0;
            else if(this._allow && !o._allow)
                return 1;
            else
                return -1;
        }

Original comment by y.vladim...@semrush.com on 25 Oct 2013 at 12:12

GoogleCodeExporter commented 8 years ago
What would be great is a test that tries out the examples at the end of 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt 
(with both orders of the allow/disallow rules) to validate whether the above 
would actually work.

Original comment by kkrugler...@transpac.com on 26 Oct 2013 at 11:16

GoogleCodeExporter commented 8 years ago
Rolled in change as per y.vladimirov in r116.

Original comment by kkrugler...@transpac.com on 14 Mar 2014 at 12:03