enso-org / enso

Hybrid visual and textual functional programming.
https://enso.org
Apache License 2.0
7.34k stars 322 forks source link

Audit Logs should be able to handle pressure #9870

Open radeusgd opened 5 months ago

radeusgd commented 5 months ago

For efficiency, the logs are sent on a background thread.

There can be a problem if the logs are recorded faster than the background thread is able to process them, especially as currently they are sent one-by-one.

On my machine, the round-trip for sending a log message to the cloud usually takes around 100ms but can peak at even 2s per message. This means that in good conditions the maximum throughput can be at around 10 log events per second. This seems to be okay-ish for our current use cases, but it may very easily be overwhelmed if a big operation is performed that runs a lot of queries quickly. If the operation runs for a short time, the pending logs will be queued and should be sent with some delay. The problem starts if the operation keeps running for a longer time at a too high throughput: more and more log messages will be queued and the system may be unable to keep up.

We have two values that we need to weight:

  1. efficiency
  2. reliability of log events

The messages are sent in background for efficiency, but if the system is overwhelmed that may lead to problems with reliability.

Solutions to consider:

radeusgd commented 5 months ago

Setting as low priority, because while this is a problem, current workloads should not be likely to encounter it too much.

radeusgd commented 3 months ago
  • This would require changes in the Cloud to allow the /logs endpoint to accept multiple messages in a single request.

As reported by @PabloBuchu the Cloud endpoint for logs can accept a list {logs: [...]} - so batching is now supported.

enso-bot[bot] commented 1 month ago

Radosław Waśko reports a new STANDUP for yesterday (2024-08-27):

Progress: Work on audit log batching - rewriting the background thread to keep a queue. It should be finished by 2024-08-29.

Next Day: Next day I will be working on the same task. Continue

enso-bot[bot] commented 1 month ago

Radosław Waśko reports a new STANDUP for yesterday (2024-08-28):

Progress: Implemented batching. Working on shutting down the thread if not used. It should be finished by 2024-08-29.

Next Day: Next day I will be working on the same task. Ensure liveness of logging. Debug and fix issue with logs on real cloud - getting 400 HTTP errors.

enso-bot[bot] commented 1 month ago

Radosław Waśko reports a new STANDUP for yesterday (2024-08-29):

Progress: CR, some improvements to the audit log batch PR. Start work on Cloud tests on CI. It should be finished by 2024-08-29.

Next Day: Next day I will be working on the #9523 task. Try implementing a prototype

radeusgd commented 1 month ago

I've implemented the batching part of this task with #10918.

Moving back to New so that we can schedule the second part - handling pressure by blocking further async log requests when the queue exceeds some size. This has lower priority, so we can schedule whenever we have some time OR only if we see practical performance problems. Right now logging is used with low intensity that we should be very unlikely to run into problems with not having this pressure mechanism implemented. But it will be good to have it at some point in the future.