bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Rules not handled correctly #27

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

Fail for User-agent rule.

$robotsTxtContentTest = "
User-agent: SEOkicks-Robot
User-agent: Spiderlytics
User-agent: ShopWiki
User-agent: oBot/2.3.1
    Crawl-delay: 60
User-agent: 008
User-agent: Accoona
User-agent: aipbot
User-agent: aipbot*
    disallow: /
";

$parserRobotsTxt = RobotsDotText::withContent($robotsTxtContentTest);
$rulesRobotsTxt = $parserRobotsTxt->getRules();
$robotsTxtValidator = new RobotsTxtValidator($rulesRobotsTxt);

foreach($rulesRobotsTxt as $key => $value) {
    if ($robotsTxtValidator->isUrlAllow('/', $key)) {
        // Good UA
        echo "Good UA: $key".PHP_EOL;
    } else {
        // Bad UA => UA with "disallow: /" directive
        echo "Bad UA: $key".PHP_EOL;
    }
}

Result :

Good UA: seokicks-robot Good UA: spiderlytics Good UA: shopwiki Good UA: obot/2.3.1 Good UA: 008 Good UA: accoona Good UA: aipbot Bad UA: aipbot*

Should be:

Good UA: seokicks-robot Good UA: spiderlytics Good UA: shopwiki Good UA: obot/2.3.1 Bad UA: 008 Bad UA: accoona Bad UA: aipbot Bad UA: aipbot*

bopoda commented 7 years ago

Hi, @LeMoussel Are you sure that it is expected behavior "Should be"? It look likes user-agents with empty rules:

User-agent: 008 User-agent: Accoona User-agent: aipbot

Do you have link on documentation how it should work?

LeMoussel commented 7 years ago

Yes, This is checked in the Google search console For this example, Googlebot, Accoona, aipbot & aipbot* are url '/' disallow.

User-agent: Googlebot
User-agent: Accoona
User-agent: aipbot
User-agent: aipbot*
    disallow: /

image

bopoda commented 7 years ago

really, thanks. Then it is a bug which should be fixed.

LeMoussel commented 7 years ago

This code fixe this bug

$parserRobotsTxt = RobotsDotText::withContent($robotsTxtContentTest);
$rulesRobotsTxt = $parserRobotsTxt->getRules();

// New ///////////////////////////////////////////////////////////////////////////
// Assigns rules to all user agents
$UA = array_keys(array_reverse($rulesRobotsTxt));
for ($i = 1; $i < count($UA); $i++) {
    if (empty($rulesRobotsTxt[$UA[$i]])) {
        $rulesRobotsTxt[$UA[$i]] = &$rulesRobotsTxt[$UA[$i-1]];
    }
}
////////////////////////////////////////////////////////////////////////////////////

$robotsTxtValidator = new RobotsTxtValidator($rulesRobotsTxt);

I didn't realize PR because I don't know how to do that.

LeMoussel commented 7 years ago

I think I have found a solution. Indeed I think it is necessary to put code in prepareRules() like this :

    /**
     * Parse rules
     *
     * @return void
     */
    public function prepareRules()
    {
        $contentLength = mb_strlen($this->content);
        while ($this->char_index <= $contentLength) {
            $this->step();
        }

        foreach ($this->rules as $userAgent => $directive) {
            foreach ($directive as $directiveName => $directiveValue) {
                if (is_array($directiveValue)) {
                    $this->rules[$userAgent][$directiveName] = array_values(array_unique($directiveValue));
                }
            }
        }

                // New ///////////////////////////////////////////////////////////////////////////
                // Assigns rules to all user agents
        $uaRules = array_keys(array_reverse($this->rules));
        for ($i = 1; $i < count($uaRules); $i++) {
            if (empty($this->rules[$uaRules[$i]])) {
                $this->rules[$uaRules[$i]] = &$this->rules[$uaRules[$i-1]];
            }
        }
    }

If you are OK, I can do PR.

bopoda commented 7 years ago

User-agent: SEOkicks-Robot User-agent: Spiderlytics User-agent: ShopWiki User-agent: oBot/2.3.1 Crawl-delay: 60

Hi @LeMoussel, as i understand, here RobotsTxtParser should return same rules for these 4 user-agents. Because at the moment RobotsTxtValidator will know nothing about first 3 user-agents...

So, it seems need to improve RobotsTxtParser.