Open radeusgd opened 5 months ago
Setting as low priority, because while this is a problem, current workloads should not be likely to encounter it too much.
- This would require changes in the Cloud to allow the
/logs
endpoint to accept multiple messages in a single request.
As reported by @PabloBuchu the Cloud endpoint for logs can accept a list {logs: [...]}
- so batching is now supported.
Radosław Waśko reports a new STANDUP for yesterday (2024-08-27):
Progress: Work on audit log batching - rewriting the background thread to keep a queue. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the same task. Continue
Radosław Waśko reports a new STANDUP for yesterday (2024-08-28):
Progress: Implemented batching. Working on shutting down the thread if not used. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the same task. Ensure liveness of logging. Debug and fix issue with logs on real cloud - getting 400 HTTP errors.
Radosław Waśko reports a new STANDUP for yesterday (2024-08-29):
Progress: CR, some improvements to the audit log batch PR. Start work on Cloud tests on CI. It should be finished by 2024-08-29.
Next Day: Next day I will be working on the #9523 task. Try implementing a prototype
I've implemented the batching part of this task with #10918.
Moving back to New so that we can schedule the second part - handling pressure by blocking further async log requests when the queue exceeds some size. This has lower priority, so we can schedule whenever we have some time OR only if we see practical performance problems. Right now logging is used with low intensity that we should be very unlikely to run into problems with not having this pressure mechanism implemented. But it will be good to have it at some point in the future.
For efficiency, the logs are sent on a background thread.
There can be a problem if the logs are recorded faster than the background thread is able to process them, especially as currently they are sent one-by-one.
On my machine, the round-trip for sending a log message to the cloud usually takes around 100ms but can peak at even 2s per message. This means that in good conditions the maximum throughput can be at around 10 log events per second. This seems to be okay-ish for our current use cases, but it may very easily be overwhelmed if a big operation is performed that runs a lot of queries quickly. If the operation runs for a short time, the pending logs will be queued and should be sent with some delay. The problem starts if the operation keeps running for a longer time at a too high throughput: more and more log messages will be queued and the system may be unable to keep up.
We have two values that we need to weight:
The messages are sent in background for efficiency, but if the system is overwhelmed that may lead to problems with reliability.
Solutions to consider:
/logs
endpoint to accept multiple messages in a single request.