bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Order of precedence for google user agents #46

Closed kennylajara closed 3 years ago

kennylajara commented 7 years ago

Some crawlers (like GoogleBot) may have multiple useragents and one of the names is like a common name that should match for the other specific names (this can be better understood by reading Google's Robots.txt Specifications - Order of precedence for user agents).

It will be nice if one can input an ordered array in the useragent param of the validator in order to replicate that behavior.

I can work on this in another moment if nobody takes the job. I don't have the time right now.

bopoda commented 7 years ago

may be do you have example which works incorrectly? it will be very good to check it. Now rules should be validated by rules which related to crawler name if rules for this crawler exist otherwise by rules for '*'.

kennylajara commented 7 years ago

Well I have not tested, but looking at the code I see the function expect a string, so * is the only fallback.

The idea is that (using Google's example), when Google-News check the robots.txt it looks for Google-News, if not exist, the looks for GoogleBot and if not exist, then it looks for *. This script, as far I can see in the code (not tested, and I can't right now), would jump directly from Google-News to * without stepping on GoogleBot.

What I suggest is to open the to replicate Google*s behavior.

bopoda commented 7 years ago

I understood what you mean. Yes, you are right.

seems it is better to implement only for Google user-agents, not all.

kennylajara commented 7 years ago

I can do it later if you are open to the posibility. I think it wouldn't add more than 5 or 6 lines to the code and looks like a really good Google's level feature to me.

What do you think? Would you merge it?

bopoda commented 7 years ago

Yes, will be great. only thing i suggest to include in PR the simplest phpunit test which should check how this functionality works.

kennylajara commented 7 years ago

Lol... Ok... I don't really know phpunit, but I want to learn, so... Do you know some simple tutorial or something?

bopoda commented 7 years ago

feel free to create PR without test. You can see another tests in tests/ directory. Run tests using:

$ cd /repo
$ phpunit  #to run all tests

or: phpunit tests/HostTest.php #to run single test