Proposal: Eager start of inputs

colinsurprenant commented 4 years ago

Relates to #11175 #11170

Context

Logstash is launching workers pipelines initialization and execution in threads and then immediately starting the input threads. This strategy has produced a different behaviour between the Ruby and Java execution:

With the Ruby execution the pipelines initialization and execution was almost almost immediate so there was no noticeable delay between the input starting and the worker processing data.
With the Java execution the pipeline initialization is slower because of the involved compilation. The pipeline initialization time has been improved in #11482 but nonetheless it will always take longer than the legacy Ruby execution.

In #11492 we will be making sure that the pipeline initialization is completed before starting the inputs. This is an easier to understand behaviour and will become the default.

Proposal

In some use-cases it might be desirable to have the possibility to eagerly start inputs, especially in conjunction with Persistent Queue enabled to minimize data loss by having inputs start ASAP and write data to PQ while the pipeline initialization is in progress.

peacand commented 4 years ago

To my mind, it would be nice to have this option because the "best" behavior depends actually on the sources and the target we want to achieve. For realtime sources not able to manage back pressure such as Syslog UDP, it could be nice to start the listeners asap with PQ to prevent any dataloss.

But in my case for example, the input volume is so high so that if I start the inputs early and the compilation of the filters takes ~6min, with PQ the pipeline will never be able to catch up and the outputs will be ~5min late forever. Which I absolutely don't want. But it is specific to my usage.

So I think having an option to start the input listeners as early as possible or only after the filters are ready makes sense.

colinsurprenant commented 4 years ago

Thanks for your feedback @peacand. Two observations:

I assume that your ~6min compilation is post #11482? (7.5.2+)
From reading your comment it feels like we could imagine that "eagerly start inputs" option be applicable per specific input. Ideally I would imagine someone wanting to control that differently for multiple inputs to use multiple pipelines but I could also see the case for individually controlling that per-input.
We've been having some discussion about PQ read prioritization, where we could have an option to prioritize reading live inbound data before the queued data in situations where it is more important to have the latest live data processed than whatever has been accumulated in PQ, that way the PQ backlog would be processed when the throughput of live data would slowdown and allow processing the PQ'ed data. It feels like this PQ read prioritization idea could be an interesting complement to the "eagerly start inputs" option which would allow processing the live data when the workers are done initializing but not loose that ~6min of startup data and process it whenever the workers are less busy.

elastic / logstash

Proposal: Eager start of inputs #11493

Context

Proposal