bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

IsUrlAllow Google test #14

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

From Google Robots.txt Specifications Order of precedence for group-member records

// Google Test
$robotsTxtContent = <<<ROBOTS
User-agent: *
Allow: /p
Disallow: /
Allow: /folder/
Disallow: /folder
Allow: /page
Disallow: /*.htm
Allow: /$
ROBOTS;
$parser = new RobotsTxtParser($robotsTxtContent);
$robotsTxtValidator = new RobotsTxtValidator($parser->getRules());
// http://example.com/page  => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/page');
// http://example.com/folder/page => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/folder/page');
// http://example.com/ => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/');
// http://example.com/page.htm => disallow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/page.htm');
bopoda commented 7 years ago

ok, i will check it later. Only on next week..

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/21 PR with tests.

From google spec:

URL allow: disallow: Verdict Comments
http://example.com/page /p / allow RobotsTxtValidator - ok
http://example.com/folder/page /folder/ /folder allow RobotsTxtValidator - ok
http://example.com/page.htm /page /*.htm undefined detects as disallow at the moment. If it is undefined from google side, should we fix it?
http://example.com/ /$ / allow RobotsTxtValidator - ok
http://example.com/page.htm /$ / disallow RobotsTxtValidator - ok
LeMoussel commented 7 years ago

Good job. Thanks a lot 👍 Perhaps for Google undefined maybe Allow or Disallow. It depend on sorted robots.txt (see Issue #13)

bopoda commented 7 years ago

I will add more tests from google Robots.txt Specifications . May be all. There a lot of cases. It will give information about library correctness.

LeMoussel commented 7 years ago

Perhaps Google robots.txt file, can help build tests.

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/22 As i see, all cases from table https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches looks correct except case-sensitive cases. Travis build: https://travis-ci.org/bopoda/robots-txt-parser/jobs/191336766.

in google spec:

The \<field> element is case-insensitive. The \<value> element may be case-sensitive, depending on the element.

@LeMoussel should we make values case-sensitive? Now parser makes all them in lower case.

LeMoussel commented 7 years ago

Yes make values case-sensitive (Rem: Url is case-sensitive). It's a bug to makes all them in lower case.

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/23 RobotsTxtParser make values (Urls) case-sensitive https://github.com/bopoda/robots-txt-parser/pull/22 tests from google spec and fix RobotsTxtValidator to check allow/disallow urls as case-sensitive Now in master all tests from google specifications are passed.