Googlebot ignores robots.txt?

yarikoptic commented 6 years ago

last line in apache log file:

66.249.65.212 - - [20/Sep/2018:12:29:58 -0400] "GET /crcns/.git/objects/29/f8a0ae8c2ad4e7534b12f3cb68b9e8247b1933 HTTP/1.1" 200 1745 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

$> cat robots.txt 
Agent: *
Disallow: /abide
Disallow: /abide2
Disallow: /adhd200
Disallow: /allen-brain-observatory
Disallow: /balsa
Disallow: /corr
Disallow: /crcns
Disallow: /datapackage.json
Disallow: /dbic
Disallow: /devel
Disallow: /dicoms
Disallow: /.git
Disallow: /.gitattributes
Disallow: /.gitmodules
Disallow: /hbnssi
Disallow: /index.html
Disallow: /indi
Disallow: /kaggle
Disallow: /labs
Disallow: /neurovault
Disallow: /nidm
Disallow: /openfmri
Disallow: /singularity
Disallow: /workshops

$> whois 66.249.65.212

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/resources/whois_reporting/index.html
#
# Copyright 1997-2018, American Registry for Internet Numbers, Ltd.
#

NetRange:       66.249.64.0 - 66.249.95.255
CIDR:           66.249.64.0/19
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:         NET66 (NET-66-0-0-0-0)
...

and robots.txt is accessed by google bots:

$> grep robots.txt datasets.datalad.org-access-comb.log | grep Google
66.249.79.206 - - [18/Sep/2018:05:31:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [19/Sep/2018:05:34:02 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.97 - - [19/Sep/2018:18:08:17 -0400] "GET /robots.txt HTTP/1.1" 200 4030 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [20/Sep/2018:05:36:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

@aqw - have a clue what is going on?

did I misspecify robots.txt may be?
access pattern from that host is interesting in that it is selectively accessing only some datasets, but may be it is just because it is all farmed out to a bunch of Googlebot instances

Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.

aqw commented 6 years ago

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

aqw commented 6 years ago

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:

Disallow: /*/.git/

yarikoptic commented 6 years ago

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

bill-murray-banging-head

yarikoptic commented 6 years ago

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:
Disallow: /*/.git/
Whenever I looked before I could not find clarity, e.g. from https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_%22*%22_match

Universal "*" match
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: 
statement. Some crawlers like Googlebot recognize strings containing "*", while MSNbot and Teoma 
interpret it in different ways

but even there not clear how it is recognizing. e.g. we have .git across a number of levels. I guess I could add

Disallow: /.git/
Disallow: /*/.git/
Disallow: /*/*/.git/
Disallow: /*/*/*/.git/
Disallow: /*/*/*/*/.git/
Disallow: /*/*/*/*/*/.git/
Disallow: /*/*/*/*/*/*/.git/

to cover some

THANKS ;)

aqw commented 6 years ago

Yeah, there isn't clarity, it's more of a living standard. Bots don't /have/ to follow your rules. They just should. And if they don't, you can ban them.

Wildcard support looks to be common, and globs across directories, so you shouldn't need a glob per level. Perhaps that will help some less sophisticated bots though.

datalad / datasets.datalad.org

Googlebot ignores robots.txt? #20