Open consulthys opened 7 years ago
@colinsurprenant do you have any thoughts on this?
If define multiple pipeline,
but each pipeline with only one workers,
does this model solve the issue ?
For example: 2 CPUs, so define 2 pipelines, each with one worker on aggregate
.
@consulthys
Any progress on this enhancement here or tracked elsewhere?
I too would be interested in optimising log aggregation performance for my pipeline.
This seems like a good approach; routing events to a particular worker for a single pipeline by some key term. For me I'd split it by submitting host.
In case Logstash runs with multiple pipeline workers, events are handed out to each worker in turn as they get available. Certain plugins (e.g. such as
aggregate
but also a few others doing some in-memory caching) require the user to set a single worker (-w 1
) in order for the events not to be processed out of sequence. It's sad to have 2+ CPUs and only be able to use a single one just because of this limitation.I'm wondering if we could potentially avoid this by introducing a way to route events to a specific worker, very much like how the
routing
parameter works in ES in regard to routing documents to specific shards or the partitioning key in Kafka. The idea would be to introduce a base-levelrouting
parameter that we could add to any inputs and whose (hashed) value would be taken into account in order for Logstash to know to which worker the event should be routed to (worker_id = hash(routing) % pipeline.workers
). This value could be static or could come from some field value of the event. In this latter case, this would also imply that the field needs to be available before the filter phase, which might not be the case for inputs that produceplain
events by default, but that could be circumvented by using thejson
codec instead.Let's consider a file containing JSON logs such as
A
file
input consuming those logs could be made "worker-aware" by routing the events based on the value of thedeviceId
field like thisFrom this point on, we can ensure that all events of device
123
will always be handled by the same worker thread and the same goes for device456
.This change could allow users
logstash-filter-cache
)I saw that it will soon be possible to define several pipelines inside a single Logstash instance, but that won't solve the issue I'm pointing at here since worker threads are internal to each defined pipeline.
Does anyone have any thoughts on this? Is there a similar effort going on internally?
Note: This was originally posted in the discuss forum