Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Logs service can be overwhelmed easily #4953

Closed edongashi closed 3 years ago

edongashi commented 3 years ago

Running a load test for a few minutes results in hundreds of thousands of piled up messages in the logs service. From my observations it can only process a few hundred per second. For a long running system this is unacceptable because it can go to millions.

image

What should we do?

edongashi commented 3 years ago

Maybe take only up to X debug messages per second? Under a high load most will be useless anyway...

edongashi commented 3 years ago

Also the UI shows messages from a few minutes ago until it catches up. If the logs are constant it will never catch up and the difference will just grow...

sebastian commented 3 years ago

Maybe it should just collect the messages and then batch save them once every 500ms or something? Batch inserts are way more efficient.

edongashi commented 3 years ago

Batch inserts are way more efficient.

Could be. I'll try to tweak the code to see if we can batch them up and then I'll run the measurements again.

cristianberneanu commented 3 years ago

I don't think batching is a complete solution to this issue. It will alleviate the problem a bit, but the system will still put a high load on the database for a non-critical feature. I think the problem is that debug logging is very noisy, so we should ignore debug logs (or assume they will only be activated sporadically) or drop them on high load (which we need to detect somehow).

edongashi commented 3 years ago

I agree with @cristianberneanu that batching will not be the fix. It will improve performance but only linearly.

or drop them on high load (which we need to detect somehow).

I suggest sampling X first debug messages for each time slot.

We can store a tuple {min_time, count} in genserver state. Once count reaches a threshhold we increment min_time by a second or similar and reset count to 0. We reject any messages < min_time.

cristianberneanu commented 3 years ago

We can store a tuple {min_time, count} in genserver state. Once count reaches a threshhold we increment min_time by a second or similar and reset count to 0. We reject any messages < min_time.

This will allow some debug message and drop others, making the log unreliable. It would be better in my opinion to drop all debug messages for some period during high load and warn about it instead.

sebastian commented 3 years ago

I agree with @cristianberneanu that batching will not be the fix. It will improve performance but only linearly.

It is not a panacea, but it might be more than enough? It might be the difference between 100 log messages inserted per second and saturating the DB and 3000 log message inserted without stressing the DB...

Building overload protection etc seems rather complex.

edongashi commented 3 years ago

Ok, I see 2 orthogonal tasks here:

  1. Batch updates
  2. Detect overloads

I'm trying to implement the first one, unless Cristian knows any shortcuts.

cristianberneanu commented 3 years ago

I'm trying to implement the first one, unless Cristian knows any shortcuts.

You mean a shortcut to do batch updates? Nothing else comes to mind besides gathering logs in a queue and periodically flushing that into the database.

cristianberneanu commented 3 years ago

You could do both at the same time if you only flush a limited amount of entries per interval: if the queue grows too big, go into the "high load mode" for x intervals and ignore debug messages.

edongashi commented 3 years ago

You could do both at the same time if you only flush a limited amount of entries per interval: if the queue grows too big, go into the "high load mode" for x intervals and ignore debug messages.

Good idea! The Air.Service.Logs genserver can handle both in state.

edongashi commented 3 years ago

if the queue grows too big, go into the "high load mode"

Should we discard debug messages from this already-filled queue?

cristianberneanu commented 3 years ago

I think that makes most sense.