Open yarikoptic opened 6 years ago
@yarikoptic I believe it's supposed to be User-Agent: *
rather than Agent: *
As for .git directories, that should be easy with a wildcard. Untested, but something akin to:
Disallow: /*/.git/
@yarikoptic I believe it's supposed to be
User-Agent: *
rather thanAgent: *
As for .git directories, that should be easy with a wildcard. Untested, but something akin to:
Disallow: /*/.git/
Whenever I looked before I could not find clarity, e.g. from https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_%22*%22_match
Universal "*" match
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow:
statement. Some crawlers like Googlebot recognize strings containing "*", while MSNbot and Teoma
interpret it in different ways
but even there not clear how it is recognizing. e.g. we have .git
across a number of levels. I guess I could add
Disallow: /.git/
Disallow: /*/.git/
Disallow: /*/*/.git/
Disallow: /*/*/*/.git/
Disallow: /*/*/*/*/.git/
Disallow: /*/*/*/*/*/.git/
Disallow: /*/*/*/*/*/*/.git/
to cover some
THANKS ;)
Yeah, there isn't clarity, it's more of a living standard. Bots don't /have/ to follow your rules. They just should. And if they don't, you can ban them.
Wildcard support looks to be common, and globs across directories, so you shouldn't need a glob per level. Perhaps that will help some less sophisticated bots though.
last line in apache log file:
and robots.txt is accessed by google bots:
@aqw - have a clue what is going on?
Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.