splunk nozzle now doesn't block, won't disconnect for being slow

cloudfoundry-community / splunk-firehose-nozzle

Send CF component metrics, CF app logs, and CF app metrics to Splunk

Apache License 2.0

29 stars 29 forks source link

splunk nozzle now doesn't block, won't disconnect for being slow #284

Closed Benjamintf1 closed 2 years ago

Benjamintf1 commented 3 years ago

(I may be able to cleanup some of the code with respect to the eventrouter in testing)

Benjamintf1 commented 3 years ago

(I think generally you'll be safer if the dropping buffer is as close to and with as little code in between it and the read call, but IIRC it looked fine, but made the requested changes)

As for "having it be an option", if the nozzle blocks, those envelopes are getting dropped one way or another, there's no "not dropping" option, you're only forcing those envelopes to drop at the doppler due to the nozzle being unable to deliver the envelopes fast enough.

Benjamintf1 commented 3 years ago

https://github.com/cloudfoundry-community/splunk-firehose-nozzle/issues/285

Benjamintf1 commented 3 years ago

Spent some time looking at this a bit more, this pr is actually not sufficient. Blocking could occur here https://github.com/cloudfoundry-community/splunk-firehose-nozzle/blob/ecf9398cfbe3ce83e0252b05fa469937bf80b852/eventrouter/default.go#L67-L68.

(In this case, I'm going to define blocking as "a step that can take on the order of milliseconds rather then nanoseconds"

I suspect most of the blocking we see in cases is actually downstream blocking and not capi blocking, I think the code there should be mostly performant, but it's something I think is going to be hard to determine with the code instrumented as such.

kashyap-splunk commented 3 years ago

Hi @Benjamintf1 I get your point. I am trying to think of way to free the main thread to receive the next event immediately after receiving the current event from the doppler. For example, starting a separate Goroutine to route the event in this line and then either limit the Goroutines count (Bounded parallelism) or fix wait time before dropping the events.

https://github.com/cloudfoundry-community/splunk-firehose-nozzle/blob/develop/nozzle/nozzle.go#L69

Let me know your thoughts.

Benjamintf1 commented 3 years ago

No, both of those will still allow promulgation of backpressure at not 100% cpu. The only way to prevent this is to either balloon(very bad) memory, or drop(less bad).

Benjamintf1 commented 3 years ago

I can leave more context in the issue(#285), I think we may have alternative or changed prs.

Benjamintf1 commented 2 years ago

(as a short note on shortening timeouts. 1000 logs per second is 1 log per millisecond. If the processing is too slow, eventually the buffer will fill up, and a even a extremely slow 10ms timeout is 10 lost logs, 5 seconds is 5,000, that's half a doppler buffer. Even if we reduce that by a factor of 5, a slow nozzle will have dropped 1000 logs on doppler for being unable to read before it reports the one metric it tried to send downstream)

Benjamintf1 commented 2 years ago

Actually, based on concerns I'm hearing about capi. I'm going to close this in favor of saying I'd highly suggest a solution that looks more like the first pr. Or a merging of the two. I can make a new pr if wanted.