esmero / archipelago-deployment

Archipelago Commons Docker Deployment Repository
33 stars 16 forks source link

Explore rate-limiting connections so we do not suffer from bots and learn from old mistakes #77

Open DiegoPino opened 3 years ago

DiegoPino commented 3 years ago

What?

Bots are the worst. They crawl in masses and many ignore robots.txt. They over use our resources and generally end in making us having to restart services just to close open connections (cantaloupe, Solr, etc)

For archipelago (at least the default deployment) we have https://www.nginx.com/blog/rate-limiting-nginx/

This way we can limit certain endpoints we know may suffer from rate limit issues leading to Denial of Service

@giancarlobi @dmer what we want is to have a list of possible pages (patterns) that can suffer from this

e.g

/user/logic
/search
/cantaloupe/iiif/2
/webform_strawberryfield/
/webform_strawberry/auth_autocomplete/*
/webform_strawberry/nominatim/*
/do/{node}/iiif/{uuid}/full/full/0/{format}

Real challenge is to think in terms of a user here. We want normal operations to happen correctly, we do not want to block access.

Any ideas?

dmer commented 3 years ago

I've had recent experience with Bad Bots who not only ignore robots.txt, but do a terrible job crawling the site - just hitting every single link at high frequency, revisiting pages multiple times, etc... In this case it wouldn't help to have a list of pages to block - perhaps very broad patterns would work.

What I had imagined was an ability to rate-limit all traffic coming in to the site. Ideally there would be a configuration page that showed some stats and allowed the mgr to adjust things like the rate beyond which blocking occurs, or the amount of time the block is in place. Create exceptions for known harvesters etc.

It might also be useful to allow patterns (mostly thinking about matching user agent or other header info) so if a pernicious bot that is bad won't go away and keeps coming from different IP's etc. you can block it by it's name.