Closed kennylajara closed 3 years ago
may be do you have example which works incorrectly? it will be very good to check it. Now rules should be validated by rules which related to crawler name if rules for this crawler exist otherwise by rules for '*'.
Well I have not tested, but looking at the code I see the function expect a string, so *
is the only fallback.
The idea is that (using Google's example), when Google-News
check the robots.txt it looks for Google-News
, if not exist, the looks for GoogleBot
and if not exist, then it looks for *
. This script, as far I can see in the code (not tested, and I can't right now), would jump directly from Google-News
to *
without stepping on GoogleBot
.
What I suggest is to open the to replicate Google*s behavior.
I understood what you mean. Yes, you are right.
seems it is better to implement only for Google user-agents, not all.
I can do it later if you are open to the posibility. I think it wouldn't add more than 5 or 6 lines to the code and looks like a really good Google's level feature to me.
What do you think? Would you merge it?
Yes, will be great. only thing i suggest to include in PR the simplest phpunit test which should check how this functionality works.
Lol... Ok... I don't really know phpunit, but I want to learn, so... Do you know some simple tutorial or something?
feel free to create PR without test. You can see another tests in tests/ directory. Run tests using:
$ cd /repo
$ phpunit #to run all tests
or:
phpunit tests/HostTest.php #to run single test
Some crawlers (like GoogleBot) may have multiple useragents and one of the names is like a common name that should match for the other specific names (this can be better understood by reading Google's Robots.txt Specifications - Order of precedence for user agents).
It will be nice if one can input an ordered array in the useragent param of the validator in order to replicate that behavior.
I can work on this in another moment if nobody takes the job. I don't have the time right now.