datalad / datasets.datalad.org

Registry of public datasets provided by the DataLad project
http://datasets.datalad.org
7 stars 5 forks source link

Googlebot ignores robots.txt? #20

Open yarikoptic opened 6 years ago

yarikoptic commented 6 years ago

last line in apache log file:

66.249.65.212 - - [20/Sep/2018:12:29:58 -0400] "GET /crcns/.git/objects/29/f8a0ae8c2ad4e7534b12f3cb68b9e8247b1933 HTTP/1.1" 200 1745 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

$> cat robots.txt 
Agent: *
Disallow: /abide
Disallow: /abide2
Disallow: /adhd200
Disallow: /allen-brain-observatory
Disallow: /balsa
Disallow: /corr
Disallow: /crcns
Disallow: /datapackage.json
Disallow: /dbic
Disallow: /devel
Disallow: /dicoms
Disallow: /.git
Disallow: /.gitattributes
Disallow: /.gitmodules
Disallow: /hbnssi
Disallow: /index.html
Disallow: /indi
Disallow: /kaggle
Disallow: /labs
Disallow: /neurovault
Disallow: /nidm
Disallow: /openfmri
Disallow: /singularity
Disallow: /workshops

$> whois 66.249.65.212

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/resources/whois_reporting/index.html
#
# Copyright 1997-2018, American Registry for Internet Numbers, Ltd.
#

NetRange:       66.249.64.0 - 66.249.95.255
CIDR:           66.249.64.0/19
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:         NET66 (NET-66-0-0-0-0)
...

and robots.txt is accessed by google bots:

$> grep robots.txt datasets.datalad.org-access-comb.log | grep Google
66.249.79.206 - - [18/Sep/2018:05:31:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [19/Sep/2018:05:34:02 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.97 - - [19/Sep/2018:18:08:17 -0400] "GET /robots.txt HTTP/1.1" 200 4030 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [20/Sep/2018:05:36:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

@aqw - have a clue what is going on?

Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.

aqw commented 6 years ago

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

aqw commented 6 years ago

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:

Disallow: /*/.git/
yarikoptic commented 6 years ago

@yarikoptic I believe it's supposed to be User-Agent: * rather than Agent: *

bill-murray-banging-head

yarikoptic commented 6 years ago

As for .git directories, that should be easy with a wildcard. Untested, but something akin to:

Disallow: /*/.git/

Whenever I looked before I could not find clarity, e.g. from https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_%22*%22_match

Universal "*" match
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: 
statement. Some crawlers like Googlebot recognize strings containing "*", while MSNbot and Teoma 
interpret it in different ways

but even there not clear how it is recognizing. e.g. we have .git across a number of levels. I guess I could add

Disallow: /.git/
Disallow: /*/.git/
Disallow: /*/*/.git/
Disallow: /*/*/*/.git/
Disallow: /*/*/*/*/.git/
Disallow: /*/*/*/*/*/.git/
Disallow: /*/*/*/*/*/*/.git/

to cover some

THANKS ;)

aqw commented 6 years ago

Yeah, there isn't clarity, it's more of a living standard. Bots don't /have/ to follow your rules. They just should. And if they don't, you can ban them.

Wildcard support looks to be common, and globs across directories, so you shouldn't need a glob per level. Perhaps that will help some less sophisticated bots though.