stop search engines from indexing test sites

GoogleCodeExporter commented 9 years ago

taku noticed that schmolli-test.pacific-aikido.org was showing up in search 
results.  not sure how that happened, but we are certainly not protecting 
against it.  i put in a stopgap for now and checked for the other test sites.

two ideas to handle this:
1. add htaccess/htpasswd files to require (easy) passwords to see the site.  
this will work on everything.
2. make custom robots.txt files for test sites that disallow robots from 
crawling it at all.  this will only work on well-behaved crawlers, but that 
should include all of the ones we care about (google, yahoo, bing.)

both of these options have issues, the main one being that we already have 
htaccess and robots.txt files for the main site, so we are talking about making 
split configurations now, which we have always wanted to avoid.  the only way i 
can see to avoid it is to

3. push all of the non-indexable content (just /cgi-bin/) off into a new site 
(register.pacific-aikido.org) which is protected by robots.txt containing a 
global ignore.  this would allow us to remove the robots.txt file from the main 
site, add "robots.txt" to the svn:ignore property, and then put whatever we 
want into the test site robots.txt files.

option 3 is the cleanest from the robots perspective, but breaks as soon as we 
add anything else to the main site that needs to not be indexed.  i am drawing 
a blank on use cases that will cause a problem, though, so this looks like our 
best option.

Original issue reported on code.google.com by schmo...@frozencrow.org on 20 Jun 2011 at 5:34

GoogleCodeExporter commented 9 years ago

i found a problem with option 3, which is that it makes it way harder to do any 
development on the registration part of the site.  i am going to fall back to 
option 2.  the plan is to
1. remove the existing robots.txt from the repository and add an svn:ignore tag 
to make sure we never re-add it.
2. make update-all.sh write out (the current) robots.txt into the prod directory
3. make site-ops write out a fully restrictive robots.txt into all test site 
directories.

ugly, but no worse than anything else we could do.

Original comment by schmo...@frozencrow.org on 23 Jun 2011 at 5:49

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Ed,
I have a feeling I misunderstand something about robots.txt and svn:ignore. 
Have you considered the following idea?
In every non-production account put robots.txt that will disallow everything:
    User-agent: *
    Disallow: /
And put svn:ignore for robots.txt so it doesn't get checked into SVN and make 
it into production account.
Will that work? Did I miss something?

Original comment by victor.l...@gmail.com on 23 Jun 2011 at 3:41

GoogleCodeExporter commented 9 years ago

yes, that is mostly what i am going to do.  we also have to remove
robots.txt from the repo and make update all.sh keep it up to date because
svn:ignore does not work on edits, just add and status.

Original comment by schmo...@frozencrow.org on 23 Jun 2011 at 4:56

TheProjecter / pacific-aikido

stop search engines from indexing test sites #73