bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Directive with '+' not handled correctly #48

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

Bad detection if presence of + in the directive.

$robotsTxtContentIssue = "
User-agent: *
Disallow: *telecommande++*
";

$parserRobotsTxt = RobotsDotText::withContent($robotsTxtContentIssue);
$rulesRobotsTxt = $parserRobotsTxt->getRules();
$robotsTxtValidator = new RobotsTxtValidator($rulesRobotsTxt);

if ($robotsTxtValidator->isUrlAllow('http://www.test.com/telecommandes-box-decodeur.html')) {
        echo "Allow.".PHP_EOL;
} else {
        echo "Disallow.".PHP_EOL;
}

Result :

Disallow

Should be :

Allow

Google search console - robots.txt Tester : image

Per Google’s robots.txt documentation, the + has no special meaning, but * has one (it means: any sequence of characters). So crawling of /telecommandes-box-decodeur.html would still be allowed. Disallowed would be, for example, crawling of /foo/telecommande++bar.html.

Maybe one solution would be to add \ from some special regular expression characters except * (*telecommande++* => *telecommande\+\+*)

bopoda commented 7 years ago

Hi @LeMoussel, should be fixed by this line. Please write if something else.