Narrow site results in Google

gwern commented 12 years ago

I was quickly pasting a link summarizing existing predictions for a fine point in Methods of Rationality (https://encrypted.google.com/search?num=100&q=hat%20and%20cloak%20site%3Apredictionbook.com) and I noticed, not for the first time, that the actual prediction was being drowned out by the user pages.

Inasmuch as you can get to any relevant user page by the actual prediction, I think they're noise.

Fortunately, there's a very easy solution - we can just exclude /users/ in the robots.txt. However, in keeping with my usual long-term interest, I still want the Internet Archive to have access; I think we can combine the two like thus (wrote it up based on the Wikipedia info and http://www.archive.org/about/exclude.php and then I found a similar question at http://www.archive.org/post/234741/robotstxt ):

User-agent: ia_archiver
Allow: /users/

User-Agent: *
Disallow: /users/

The downside of this approach is that someone searching for a particular user would see their user-page pop up, but rather every prediction they've been involved in. This isn't a disadvantage for me, but it may be for others.

A less invasive but also less useful approach would be to augment the built-in Google site search with -site:predictionbook.com/users/. (Not as useful for me, since I rarely launch my PB site searches from that dialogue, but outside Firefox entirely.)

Anyway, I'm going to patch robots.txt as suggested above.

ivankozik commented 12 years ago

I think I would prefer the less-invasive approach, given how robots.txt is kind of a nuclear option that affects all crawlers (except for the one you exclude, of course).

This method of exclusion works:

hat and cloak site:predictionbook.com -inurl:predictionbook.com/users/

And you can make this useful for yourself too by changing your bookmark keyword search / Chrome search, no?

gwern commented 12 years ago

Well, yes, I know how to exclude the Google hits (I gave a better way above); my point was about defaults. Is there a good reason to expose user pages to search engines when we know for a fact that every prediction will generate N spurious hits where N is the number of users commenting or predicting on it?

ivankozik commented 12 years ago

Neat, I didn't know -site:url works too.

User pages are sometimes interesting, and given how Google is smart enough to rank them below prediction pages, I don't think they're a big problem. robots.txt also affects things like wget and HTTrack, and making them ignore robots.txt is annoying. (And if robots.txt later lists resources that should really be ignored, those robots.txt-ignoring users will be grabbing the really-useless resources.)

gwern commented 12 years ago

Alright, then what about just filtering it for the big 3 or 4 search engines? That'll help 99% of users and avoid hitting tools like wget.

bellroy / predictionbook-legacy

Narrow site results in Google #22