Quota mechanism against app style overuse

drolbr commented 2 years ago

To allow fair shared use on the public instances, the software has a couple of quota mechanisms built in. This had been a success insofar as the servers have been effectively available most of the time.

However, a new pattern of overuse has appeared recently: the same query is sent from hundreds of IP addresses or at least more than a hundred. For the server this looks like individual power users, and thus this use pattern consumes such much server time that other users are effectively locked out.

This is a reminder to develop a quota mechanism based on the sent query.

Lee-Carre commented 2 years ago

Relevant?: mapcontrib/mapcontrib#385

drolbr commented 2 years ago

Not really, but thank you for making the connection.

The incident in question has been most likely been a hotel booking portal that searched for not-yet-to-them known hotels all over the planet, and in addition with a faulty query. It monopolized the public instance because it used 400 different IP addresses for their clients, thus each single client looked pretty harmless to the Overpass server.

Practically all such events involved queries where it would have been obvious to a human being that the query never worked. I.e. real applications with real users are not a problem. They are only hit by the quota mechanisms because the server cannot perfectly tell apart benign from malign traffic.

Lee-Carre commented 2 years ago

Not really [relevant], but thank you for making the connection.

Good to know. Welcome.

Thankyou for all your work on Overpass; a wonderful tool that I'm still discovering the depths of. 👍

a hotel booking portal that searched for not-yet-to-them known hotels all over the planet, and in addition with a faulty query

Yikes!

How careless & inconsiderate. RTFM for the win.

My thoughts lead me to think of attempting to validate a query before executing it. But, sometimes we want to find weird entries (there's plenty in my area), else the only true way to know the utility of a query is to execute it.

Although I'm aware that memory usage is already high, a consideration might be a little server-side caching (especially for negative / null results to expensive queries; inspired by what some DNS implementations do), so that responding to repeat queries can be swift, without necessarily incurring the cost of trying to execute each of them afresh.

I also wonder if there was any pattern to the source address of the queries; such as if they were all coming from the same AS network (if I recall, there's a published mapping between numerical address & AS-number, probably at IANA). I recall reading a paper, some years ago (called CIDR House Rules, I think), which was concerned with tackling junk mail. The punch-line was that the vast majority of spam was coming from only a few ASes.

Though, yes, ultimately comparing queries across addresses (regardless of AS) is the better solution.

drolbr / Overpass-API

Quota mechanism against app style overuse #657