bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

isUrlAllow() not handled correctly if no path in Url. #30

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

isUrlAllow() fail when there is no path in Url.

$robotsTxtContentMultipleUA = "
User-agent: *
Allow: /

User-agent: *
Disallow: /?google_comment_id=*

User-agent: *
Disallow: /?replytocom=*

User-agent: *
Disallow: /*/?replytocom=*
";

$parserRobotsTxt = new RobotsTxtParser($robotsTxtContentMultipleUA);
$rulesRobotsTxt = $parserRobotsTxt->getRules();
$robotsTxtValidator = new RobotsTxtValidator($rulesRobotsTxt);

$url1 = 'http://site.com/page2'; // Google Allow
$url2 = 'http://site.com/?replytocom=32'; // Google Disallow
$url3 = 'http://site.com/test/?replytocom=32'; // Google Disallow

if ($robotsTxtValidator->isUrlAllow($url1)) { echo "$url1 => Allow".PHP_EOL; }
else {  echo "$url1 => Disallow".PHP_EOL; }
if ($robotsTxtValidator->isUrlAllow($url2)) { echo "$url2 => Allow".PHP_EOL; }
else {  echo "$url2 => Disallow".PHP_EOL; }
if ($robotsTxtValidator->isUrlAllow($url3)) { echo "$url3 => Allow".PHP_EOL; }
else {  echo "$url3 => Disallow".PHP_EOL; }

Result :

http://site.com/page2 => Allow http://site.com/?replytocom=32 => Allow http://site.com/test/?replytocom=32 => Allow

Should be:

http://site.com/page2 => Allow http://site.com/?replytocom=32 => Disallow http://site.com/test/?replytocom=32 => Disallow

Result with Google robots.txt Tester http://site.com/page2 => Allow image

http://site.com/?replytocom=32 => Disallow image

http://site.com/test/?replytocom=32 => Disallow image

LeMoussel commented 7 years ago

Error seem to be in RobotsTxtValidator::getRelativeUrl() with return parse_url($url, PHP_URL_PATH);

I test a patch. If that's OK I'd do a PR.