IsUrlAllow Google test - Githubissues

bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications

http://robots.jeka.by

MIT License

44 stars 17 forks source link

IsUrlAllow Google test #14

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

From Google Robots.txt Specifications Order of precedence for group-member records

// Google Test
$robotsTxtContent = <<<ROBOTS
User-agent: *
Allow: /p
Disallow: /
Allow: /folder/
Disallow: /folder
Allow: /page
Disallow: /*.htm
Allow: /$
ROBOTS;
$parser = new RobotsTxtParser($robotsTxtContent);
$robotsTxtValidator = new RobotsTxtValidator($parser->getRules());
// http://example.com/page  => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/page');
// http://example.com/folder/page => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/folder/page');
// http://example.com/ => allow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/');
// http://example.com/page.htm => disallow
$bVal = $robotsTxtValidator->isUrlAllow('http://example.com/page.htm');

bopoda commented 7 years ago

ok, i will check it later. Only on next week..

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/21 PR with tests.

From google spec:

URL	allow:	disallow:	Verdict	Comments
http://example.com/page	/p	/	allow	RobotsTxtValidator - ok
http://example.com/folder/page	/folder/	/folder	allow	RobotsTxtValidator - ok
http://example.com/page.htm	/page	/*.htm	undefined	detects as disallow at the moment. If it is undefined from google side, should we fix it?
http://example.com/	/$	/	allow	RobotsTxtValidator - ok
http://example.com/page.htm	/$	/	disallow	RobotsTxtValidator - ok

LeMoussel commented 7 years ago

Good job. Thanks a lot 👍 Perhaps for Google undefined maybe Allow or Disallow. It depend on sorted robots.txt (see Issue #13)

bopoda commented 7 years ago

I will add more tests from google Robots.txt Specifications . May be all. There a lot of cases. It will give information about library correctness.

LeMoussel commented 7 years ago

Perhaps Google robots.txt file, can help build tests.

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/22 As i see, all cases from table https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches looks correct except case-sensitive cases. Travis build: https://travis-ci.org/bopoda/robots-txt-parser/jobs/191336766.

in google spec:

The \<field> element is case-insensitive. The \<value> element may be case-sensitive, depending on the element.

@LeMoussel should we make values case-sensitive? Now parser makes all them in lower case.

LeMoussel commented 7 years ago

Yes make values case-sensitive (Rem: Url is case-sensitive). It's a bug to makes all them in lower case.

bopoda commented 7 years ago

https://github.com/bopoda/robots-txt-parser/pull/23 RobotsTxtParser make values (Urls) case-sensitive https://github.com/bopoda/robots-txt-parser/pull/22 tests from google spec and fix RobotsTxtValidator to check allow/disallow urls as case-sensitive Now in master all tests from google specifications are passed.