cloudfoundry / loggregator-agent-release

Apache License 2.0
14 stars 28 forks source link

Message retry performance implications and architectural issues #613

Open nicklas-dohrn opened 4 days ago

nicklas-dohrn commented 4 days ago

This is an issue to discuss the current state of the retry logic for syslog messages, As there are some implications, that are problematic. just listed here shortly for an overview:

I will add details and my testing results here later in a better formatted way.

ctlong commented 4 days ago

I dived a little deeper into the syslog writer code recently and I think that we were incorrect in some of our previous assertions about the synchronized nature of the agent. If you check out the syslog connector, which the manager uses to create new drains, each drain is provided with an egress diode. Since writing to the diode should be non-blocking, I think that the envelope writing loop is in fact asynchronous to some degree.

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

ctlong commented 4 days ago

High CPU usage of the agent is a known problem. Unfortunately, none of the logging and metrics agents currently have any kind of memory or CPU limitation placed upon them. They will expand as necessary to meet demand.

We took a pprof dump a while ago and saw that marshalling/unmarshalling envelopes was the primary performance issue of most of our agents. Part of what I hope to accomplish by merging every agent into the OTel collector is to reduce the number of marshal/unmarshal steps required to egress an individual envelope from a VM.