cloudfoundry-community / splunk-firehose-nozzle

Send CF component metrics, CF app logs, and CF app metrics to Splunk
Apache License 2.0
29 stars 29 forks source link

[BUG] - Spunk nozzle transfers backpreassure upstream, resulting in client disconnects and doppler drops #285

Closed Benjamintf1 closed 2 years ago

Benjamintf1 commented 3 years ago

What happened When the nozzle is unable to process all the envelopes that come from loggregator, it fills up a buffer and then stops reading from loggregator. This means that often I see nozzles that are being disconnected from traffic controller(with either the slow nozzle closure or the abnormal closure error).

What you expected to happen: To be able to reliably debug splunk errors, it should be clear that the nozzle(and downstream) are unable to process the amount of envelopes coming in, ideally by placing a dropping buffer before any possibly blocking tasks(reaching out to capi, or downstream). How to reproduce it (as minimally and precisely as possible): Can be done at the evensink level or at the ingestor level by causing the emit call to block until prompted. It could also hypothetically be reproduced by pointing the splunk nozzle at a endpoint that simply sleeps on a somehwat noisy foundation. This should cause doppler drops, and then eventually a traffic controller disconnect.

Environment:

Benjamintf1 commented 2 years ago

One of the first principles of loggregator is to not apply backpressure to applications. To work within these demands, loggregator makes heavy use of diodes(they're just ring buffers). This is mostly identical to a channel with a select and dropping on default, but old values are dropped before newer instead of vice versa. There is some value to this, I think for the size of buffers and throughput we tend to be talking about there actually isn't a huge difference between the two patterns. The buffers are usually small to allow for multiple buffers on the system and limit memory usage.

When a nozzle is receiving logs and metrics slower then traffic controller is trying to send them, it applies that backpressure to doppler(Traffic controller is more seen as a api proxy to prevent consumers from having to manage 20-40 http connections, then a proper component in some ways), which then drops metrics and logs.

Dropping not only allows continuous delivery without catastrophic failure(like oom errors), but it can also be used(if implemented correctly) to help determine what components need to be scaled.

If you see (egress)drops on the loggregator agents, then you might need to scale your dopplers, if you see (egress)drops on your dopplers, you might need to scale your nozzles, and if you see(correctly implemented) (egress)drops on your nozzles, you need to scale your downstream in some way.

This is not implemented correctly in the splunk nozzle. To fit this definition, it should be practically impossible for the nozzle to apply backpressure to the upstream unless the nozzle is operating using 100% of its cpu and input network throughput. Even a single goroutine connected to a dropping buffer or channel can process many many envelopes per second, and as long as there is spare cpu room to do so, can quickly ensure that it doesn't apply backpressure to the dopplers, and thusly you shouldn't expirience doppler drops from nozzle slowness. (Dopplers should spread load evenly between subscribers, so even if nozzles are not optimized to fully saturate the rest of the cpu, as long as the nozzles are adequately scaled(and the sum number of connections the nozzles cause doesn't overwhelm doppler and tc) you shouldn't have problems with nozzles being overly greedy and reading locally faster then it can send despite available downstream bandwidth).

Scaling the entire logging system, can take a bit of time, and we're looking at at least 5 processes on the diego vm, doppler, traffic controller, the nozzle, ???, splunk heavy forwarders, ??? that could all impact performance or be a performance bottleneck. When everything is in alignment(or at least, it's understood where gaps in alignment exist) it makes things a lot easier. Knowing where the problem is occurring is also something that's very important to know. We've been on support calls where we've seen a healthy doppler with plenty of cpu and network headroom, as well as traffic controllers and a perfectly healthy set of nozzles, and a splunk nozzle that is largely in a blocked on downstream networking state. By implementing this flow, it should both be very clear that the nozzle is dropping due to inability of downstream to keep up, and should prevent a lot of wasted time and effort looking at a perfectly well scaled upstream.

Benjamintf1 commented 2 years ago

Any thoughts?

Benjamintf1 commented 2 years ago

@kashyap-splunk @luckyj5 any thoughts? Do you still want me to make a new pr or is this something you'd like to pursue on your own?

luckyj5 commented 2 years ago

@Benjamintf1 Thanks for reaching out and all the updates around this issue. Please take a look at this PR 289 created to address this issue. Let us know your thoughts. Thanks!

Benjamintf1 commented 2 years ago

lgtm for the most part. It sounds like there's still some fear that it might be possible to drop b/c of capi slowness? Not to say it has or will happen, but having drops before it too might raise confidence and ability to show that it's not capi/caching slowness. Otherwise it seems great.

kashyap-splunk commented 2 years ago

Yes, @Benjamintf1. In the next release after this, we are planning to address this enhancement of moving parsing logic (involving capi) after event-dropping (ie. after buffer). Because we would like to get this build out as soon as possible with some other small important fixes/enhancements.

So the above PR only includes addition of events dropping for now, which is inspired from your last PR. And, we will continue on exploring further enhancements.

kashyap-splunk commented 2 years ago

Hi @Benjamintf1, I have made remaining changes as discussed above in the below PR: https://github.com/cloudfoundry-community/splunk-firehose-nozzle/pull/313

I am closing this issue as the above change should now completely resolve this issue. Please have a look at above change, and just open another issue if you feel still any specific changes needed.