Make a robots.txt file, to block cooperative crawlers.

OpenTechStrategies / lisc-ttm

LISC TTM code. See https://ttm.lisc-chicago.org/.

GNU Affero General Public License v3.0

1 stars 4 forks source link

Make a robots.txt file, to block cooperative crawlers. #28

Closed kfogel closed 9 years ago

kfogel commented 10 years ago

We don't have a robots.txt file yet. I suspect that's why we see things like, e.g., this in /var/log/apache2/ttm_error.log:

[Fri Jun 27 20:26:14 2014] [error] [client 75.168.181.198] File does not exist: /var/www/ttm/EWS
[Fri Jun 27 20:26:14 2014] [error] [client 75.168.181.198] File does not exist: /var/www/ttm/EWS
[Fri Jun 27 20:26:14 2014] [error] [client 75.168.181.198] File does not exist: /var/www/ttm/EWS
[Fri Jun 27 20:26:15 2014] [error] [client 75.168.181.198] File does not exist: /var/www/ttm/EWS
[Fri Jun 27 20:26:25 2014] [error] [client 75.168.181.198] File does not exist: /var/www/ttm/EWS
[Fri Jun 27 20:26:55 2014] [error] [client 24.52.219.245] File does not exist: /var/www/ttm/autodiscover

We should have a standard robots.txt, obviously.

cecilia-donnelly commented 10 years ago

See this article about the Robots exclusion standard: http://en.wikipedia.org/wiki/Robots_exclusion_standard.

Essentially, the robots.txt file instructs web robots/crawlers not to look at certain pages or directories. It is only an advisory measure (not prescriptive).

kfogel commented 10 years ago

I looked at the commit. We need the text for the robots.txt file itself, though :-). That is, the fix to this issue is actually creating a robots.txt in the top level of the TTM tree, and making sure it gets served correctly when "/robots.txt" is requested by a client.

MegFord commented 10 years ago

Sorry! The first commit on that branch adds the file. I split it into two commits because I wanted to edit the markup for the install file on github. So the first commit on the branch adds the file, and the second one is the edits for the INSTALL.md file :)

kfogel commented 10 years ago

Oh! Thanks.

The commit messages should reference the relevant issue number(s) -- that's what threw me :-). (That's a pretty important general principle everywhere: the bidirectional link between commit and issue is key for making things reviewable, and for forensic analysis when necessary.)

cecilia-donnelly commented 9 years ago

Meg, would you like to merge this to master or have @kfogel do it? Either way is fine.

cecilia-donnelly commented 9 years ago

Pulled this to production today and it seems to be working fine. I did not make the changes to the apache config file recommended in INSTALL.md.