benchatt / wub

Automatically exported from code.google.com/p/wub
0 stars 0 forks source link

Bots and server Cache #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Wub contains a Cache which is intended to hold generated content for as
long as it's fresh, where the generator can evict stale content from the Cache.

This works really well, as long as there's some locality of reference
between page fetches.

Unfortunately, that is precisely what spider bots *don't* do.  They cut a
swathe through the URL space, seeking in effect to sample the freshness of
all available pages.

So bots evict Cache contents, willy nilly.

It is necessary to somehow distinguish bot access from human access, and to
prevent bots evicting perfectly usable content.  But how to do this? 
Enhance the Spider package to recognise *all* spiders, and make the
distinction between bad and useful spiders?  That's a lot of data, all
driven from User-Agent field.  How else?  Access pattern recognition?  Hard.

This problem is most evident in a Wiki, where Recent Changes tends to drive
traffic to the Cache.

Original issue reported on code.google.com by mcc...@gmail.com on 5 Mar 2009 at 11:56

GoogleCodeExporter commented 9 years ago
This has been sort-of addressed by excluding bots.

Original comment by mcc...@gmail.com on 8 Jun 2010 at 8:09