Message retry performance implications and architectural issues

cloudfoundry / loggregator-agent-release

Apache License 2.0

14 stars 29 forks source link

Message retry performance implications and architectural issues #613

Open nicklas-dohrn opened 1 month ago

nicklas-dohrn commented 1 month ago

This is an issue to discuss the current state of the retry logic for syslog messages, As there are some implications, that are problematic. just listed here shortly for an overview:

having a syslog drain fail with high load will drop messages for other drains. This will also put the cpu consumption of the syslog agent over 1 cpu, not sure why
the syslog-batching implementation is not able to use the retry mechanic, as there is no state about the batching being done in the retry writer.

I will add details and my testing results here later in a better formatted way.

ctlong commented 1 month ago

I dived a little deeper into the syslog writer code recently and I think that we were incorrect in some of our previous assertions about the synchronized nature of the agent. If you check out the syslog connector, which the manager uses to create new drains, each drain is provided with an egress diode. Since writing to the diode should be non-blocking, I think that the envelope writing loop is in fact asynchronous to some degree.

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

ctlong commented 1 month ago

High CPU usage of the agent is a known problem. Unfortunately, none of the logging and metrics agents currently have any kind of memory or CPU limitation placed upon them. They will expand as necessary to meet demand.

We took a pprof dump a while ago and saw that marshalling/unmarshalling envelopes was the primary performance issue of most of our agents. Part of what I hope to accomplish by merging every agent into the OTel collector is to reduce the number of marshal/unmarshal steps required to egress an individual envelope from a VM.

nicklas-dohrn commented 1 month ago

I did some testing as well, and your assumption about every drain getting its own diode is also my understanding why there is some sort of concurrency happening. Imho, this is also unwanted behaviour, as this does not allow to set the wanted max resource consumption, so the syslog-agent is able to overload other components.

nicklas-dohrn commented 1 month ago

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

Yes, this is what I see with testing. It only allows a "dos" overload, where the dropped messages on the other non malicious receiver seem to be random. Screenshot 2024-09-26 at 08 38 16 (the image shows the inflowing data on the receiving side, should be 50log/s)