Let's Talk About Composability and Isolation (e.g. Inter-Pipeline Comms)

We now have a multiple pipelines feature.

Users already rig logstash->logstash communications up for a variety of purposes. Linking individual pipelines is something users will want to do as well.

In a broader sense the problems these users are solving by linking pipelines together can best be, in my estimation, encapsulated in two terms:

Composition: Structured programming (if / else conditions) can be cumbersome without additional organizational units. In most programming languages methods or functions serve this purpose. When linking pipelines together the pipeline can perform this same task, somewhat like the actor model.
Isolation: Pipelines have individually configurable performance criteria and are decoupled from each other by queues, which are also individually configurable. This is great when one has a forked flow and a filter plugin that interacts with an external system (like a JDBC lookup), since it can isolate performance issues with different queue / concurrency settings.

The key question of this issue is: Are these problems best solved by inter-pipeline communications?

Some alternate approaches:

Allow users to create subpipelines within pipelines.
Don't try to solve composition, just solve isolation by automatically scheduling / distributing work
Don't try to solve isolation, but add a notion of functions into logstash.
<Your Idea Here />

An initial approach at solving this with our current infrastructure is present here: https://github.com/logstash-plugins/logstash-integration-internal/pull/1

Hypothetical configurations for it are presented here as thought exercises for its use: https://gist.github.com/andrewvc/b8c31706f8b6c8d5c5a3750643247832

I'd like to focus, however, on the big questions I've presented here before we move forward with that.

I think its likely that we may come to a conclusion of "we don't know". In that case I think it could be a useful exercise to release the internal input/output as non-bundled plugins and let users play with them, learning from real-world use cases what we should build.

huge +1 to solving isolation, especially between outputs. [1]

w/r/t to composition (as the hypothetical configuration implies), it seems like it could complicate things. Both for the developers (core and plugin), and for users. Are inputs/output/filters are the right level of abstraction to compose? For example, an output that pushes to another pipeline, is it really an output or is just a means to achieve something else ? If we solved isolation, I would be curious to know the use-cases composition is solving. (If it just a saving a bit of copy-pasta, there are less complex ways)

[1] I would hope that isolation concerns are handled internally without the need user's input (unless there are different isolation strategies they want to configure)

I had a stab at solving this problem 7 years ago in my POC library called Lapaz. This describes an analogue to a LS pipeline called a route

route(:route_name=>"purchases") do
  from Processor::Purchases, {:seq_id=>0, :name=>'start'}
  to Consumer::Forwarder, {:seq_id=>1, :forward_to=>'prices/start', :reply_to=>'purchases/render/3.3'}
  to Processor::Contacts, {:seq_id=>1, :mux_id=>'3.1'}
  to Processor::StockItems, {:seq_id=>1, :mux_id=>'3.2'}
  to Processor::TemplateRenderer, {:seq_id=>2, :name=>'render'}
  to Processor::LayoutRenderer, {:seq_id=>3}
  to Consumer::MongrelConsumer, {:seq_id=>4}
end

A Component reads events from a queue topic <route>/<sequence_id> and writes events to the next queue topic `/<sequence_id + 1> unless the Component is Consumer, these are supposed to send the event to the outside world.

from and to are DSL words that accept a Ruby class and a config hash. the config hash is explained as:

:seq_id - is the sequencing order of the components.
:name - is an address given to the component that allows other components in this
    or other routes send events to it.
:mux_id - is a way of allowing demultiplexing
:forward_to - is a way to forward the event to another component in another route
:reply_to - is a way to tell the forwarded route where the reply destination is.

Things I learnt:

Repeated serialization is costly (JrJackson emerged from this), not applicable to LS.
Parallel execution works if the functions are independent. The user must design this carefully. Functions should not mutate the same fields. For example in logstash for Apache config we could parallelise date, geoip and useragent with each of these plugins receiving the same event reference as long as event.set was synchronised or we use a ConvertedConcurrentMap
Forwarding and return of an event to another route or component in a route works but it may need an event clone and merge on return.
I never explored content based routing, i.e. conditional routing if say the event content indicated that the price calculation function was needed. For LS, I think this means hoisting a plugin internal predicate (does the event contain a needed field) out of the plugin and into the routing/execution.

I guess that we need to know if its possible to know enough from the plugin and its config to automate the parallelising of two plugins in series on behalf of the user.

Maybe this further supports the idea of a PQ or a memory based one with channels/topics. An isolated sub-pipeline without outputs would read from one fixed channel/topic and write to another context based one. Ascii art below, assume that the same JDBC lookup is needed for events from PQ@m1 and PQ@m2.

PQ@m1 --> Grok >-- Q(jdbc-lookup)
PQ@m2 --> Grok >-- Date >-- Q(jdbc-lookup)

Q(ack-this) --> Acker, acks to PQ@m1|m2 (contextually derived)
Q(esout) --> ES output >-- Q(ack-this)
Q(geoip) --> Geoip >-- Useragent >-- Q(esout)

Q(jdbc-lookup) --> JDBC lookup >-- Q(contextually derived, may be esout or geoip)

Something like this, perhaps...

channeled-pipelines

I can definitely see this feature adding value as far as managablity of pipelines is concerned. The filter section of the Pipeline for my ElastiFlow solution has three main functions: normalize v5 flows, normalize v9 flows and processing of the normalized event. With this as the basis I built a kind of prototype by breaking the config into four separate pipelines with redis handling messaging between pipelines. The pipelines are:

elastiflow_netflow_collector (receiving events and pushing to appropriate normalizer)

input {
udp {
id => "netflow-input-netflow"
type => "netflow"
port => "${ELASTIFLOW_NETFLOW_PORT:2055}"
codec => netflow {
  versions => [5,9]
}
}
}
output {
redis {
id => "output-redis"
data_type => "list"
key => "logstash:elastiflow:netflow:v%{[netflow][version]}"
}
}

elastiflow_netflow5_normalizer (normalize v5 flows)

input {
redis {
id => "input-redis"
data_type => "list"
key => "logstash:elastiflow:netflow:v5"
}
}
filter {
# THE v5 NORMALIZATION LOGIC
}
output {
redis {
id => "output-redis"
data_type => "list"
key => "logstash:elastiflow:flow:normalized"
}
}

elastiflow_netflow9_normalizer (normalize v9 flows)

input {
redis {
id => "input-redis"
data_type => "list"
key => "logstash:elastiflow:netflow:v9"
}
}
filter {
# THE v9 NORMALIZATION LOGIC
}
output {
redis {
id => "output-redis"
data_type => "list"
key => "logstash:elastiflow:flow:normalized"
}
}

elastiflow_post_processor (process normalized flow records and send to Elastisearch)

input {
redis {
id => "input-redis"
data_type => "list"
key => "logstash:elastiflow:flow:normalized"
}
}
filter {
# THE POST PROCESSING LOGIC
}
output {
elasticsearch {
id => "netflow-output-elasticsearch"
hosts => [ "${ELASTIFLOW_ES_HOST:127.0.0.1:9200}" ]
user => "${ELASTIFLOW_ES_USER:elastic}"
password => "${ELASTIFLOW_ES_PASSWD:changeme}"
index => "netflow-%{+YYYY.MM.dd}"
template => "${ELASTIFLOW_TEMPLATE_PATH:/etc/logstash/templates}/netflow.template.json"
template_name => "netflow"
template_overwrite => "true"
}
}

Let's now say that I want to add handling of IPFIX or sFlow to the solution. All I have to do is add a new pipeline that collects and normalizes the flows and pushes them into the logstash:elastiflow:flow:normalizedqueue. All of the post processing logic and indexing of the data will then automatically apply.

I can demo this to anyone who wants to see it, even created it using Logstash 6.0 central config features.

More thoughts to come.

Let's look at another use-case. Sometimes you just want to be able to reuse some code in multiple pipelines. I use a number of metadata fields to control the processing of events. These values are initialized in each "collection" pipeline like this...

mutate {
  id => "init_control_metadata"
  add_field => {
    "[@metadata][output_elasticsearch]" => "${LS_OUTPUT_ELASTICSEARCH:true}"
    "[@metadata][output_stdout]" => "${LS_OUTPUT_STDOUT:false}"
    "[@metadata][output_rawcap]" => "${LS_RAWCAP:false}"
  }
}

If I need to change this block, for example adding another metadata value, I have to do so in every collection pipeline. However if I had a way to define this block elsewhere, and simply include it where needed. I would have only one place I needed to edit.

This is only one of many examples where such ability to reuse code blocks would be useful. This doesn't have to be a function. Even a rudimentary type of concatenation, where the external text is inserted inline (maybe using an include statement) would be useful. Functions could be even better.

NOTE: When I was at Micromuse our Probe Rules Files were similar in concept to a Logstash Pipeline. We had an include statement which was a kind of inline text insertion as the rules file was loaded. Leveraging this capability as kind of poor man's function, I was able to develop some complex parsing code that was easily reused in many places. The best example was the rules file code which dissected the interface names from vendors such as Cisco and Juniper, extracting out things like slot, sub-slot, physical port, sub-interfaces/VLANs, etc. From this an intra-device containment model could be built upon which various causal analytics was based. Using include statements this logic was easily inserted into the processing logic for over 1000 different syslog messages, but was able to be maintained in one place.

I don't believe inter-pipeline comms solves this problem. Rather I believe that such code reuse features allow multiple pipelines w/inter-pipeline comms to developed more flexibly and easily.

Lot's of interesting stuff in this thread! My thoughts:

@jakelandis : I think @robcowart brought up lots of great points WRT alternate composition approaches. I do think that composition for a graph (which I would argue is what Logstash configs really are, since they have unidirectional flow) looks a lot like inter-pipeline comms.

@guyboertje thanks for sharing those thoughts on parallelism! What concrete proposals would you make in this vein? Also, do you think we need to execute filters in parallel for a single event? What benefits would that obtain over executing individual events' entire pipeline in a separate thread given the rather short processing time for single events? What do you think the optimal granularity for concurrency here is?

@robcowart I love love love all this concrete feedback from the field. I think it says something that inter-pipeline comms are something you're already using. Would you prefer to have multiple logical logstash pipelines that talk to each other, or to have a pipeline have the ability to define sub-pipelines scoped within itself? In other words pipelines being more like a namespace grouping other pipelines that can be linked together.

C-style include statements also seem like they may have their place. There's some trickiness there in that they would need to be templated. Specifically explicitly setting id would be tricky there, but I think there may be some workarounds.

Hi all, coming from my thread in elastic discuss ( https://discuss.elastic.co/t/logstash-6-multiple-pipelines-for-one-input/107929 ). When I read in the elastic blog about the E6 release and the multiple pipelines, I immediately started testing it. Unfortunately it ended with the result, that I cannot use the multiple pipes like I want to. Please have indulgence, because I am quiet new to the hole ELK world. So I consider everything quiet sober. My situation in short:

in Elastic 5.x I got many different kinds of logs over one input (beats)
to grok this, I have many config files, which include the "if [type] == "xyz" " conditions
so all logs always have to go through every filter till they match
when there is a configuration failure in one filter, the hole pipeline stops and with it the processing of all logs What I thought, I could handle with the multiple pipelines:
Either, I can define multiple inputs of the same kind in the pipeline configurations. E.g. there are 3 pipelines, which listen on beats port 5666. If a new log comes over this input, all pipelines check this log with the filter rule, the pipeline which matches the filter rule, process it. The other pipelines drop it. So I could build different pipelines for different logs (e.g. a pipeline for syslogs, authlogs.... etc).
Or I am able to define an Input pipeline, where I can forward the logs to another pipeline internally. Then I would still have a large if...else pipeline config for the beats input, but I could spread all logs over the other separate pipelines. So or so, I want to avoid, that a configuration failure of one filter breaks the hole pipeline and logstash doesn't further process all the other logs. And I want to be more flexible using multiple pipes. I want to spread my many kinds of logs to different pipelines, although they all come over the same input. I want to differentiate, how much performance each pipeline takes (e.g. by settings the worker) or how important the data is (persisted queue on or off) and so on...

Hope my thoughts are understandable :)

Cheers, Marcus

There seems to be a solution for the single input -> multiple pipelines use-case in Logstash 6.3

https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html

elastic / logstash

Let's Talk About Composability and Isolation (e.g. Inter-Pipeline Comms) #8067