bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Parsing rules is platform dependent #68

Open Emosewaj opened 3 months ago

Emosewaj commented 3 months ago

Attempting to parse rules that use \n as a line separator on a Windows machine fails, as the used PHP_EOL constant is \r\n on Windows, resulting in a single user-agent directive containing the full sitemap content.

This is caused by RobotsTxtParser->prepareRules() in line 148.

   /**
     * Parse rules
     *
     * @return void
     */
    private function prepareRules()
    {
        $rows = explode(PHP_EOL, $this->content); // issue

        foreach ($rows as $row) {
    // ...

Some form of line separator detection and normalisation should be used instead.

As a workaround, using the following RegEx before handing the robots.txt content to RobotsTxtParser will work:

$content = preg_replace("/\R/u", PHP_EOL, $content);
$parser = new RobotsTxtParser($content);

This RegEx will replace all Unicode newlines with the system newline, which the parser currently uses as well. This same solution could also be applied inside the parser to fix this issue.