fandrei / AppMetrics

Apache License 2.0
8 stars 2 forks source link

Prevent analytics getting indexed by Google #55

Closed mrdavidlaing closed 12 years ago

mrdavidlaing commented 12 years ago

Some of the analytics reports have crept into the google index - see the following search:

site:analytics.metrics.labs.cityindex.com

which returns 2 results on 4 April 2012:

http://analytics.metrics.labs.cityindex.com/GetReport.ashx?Application=CiapiLatencyCollector&Period=1.0
http://analytics.metrics.labs.cityindex.com/GetReport.ashx?Application=CIAPI.CS.Excel

We need the appropriate robots.txt to prevent any indexing happening

ryanholder commented 12 years ago

Took the following from a site on robots.txt

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information.

More info available here http://en.wikipedia.org/wiki/Robots_exclusion_standard