elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.09k stars 3.48k forks source link

Let's Talk About Composability and Isolation (e.g. Inter-Pipeline Comms) #8067

Open andrewvc opened 6 years ago

andrewvc commented 6 years ago

We now have a multiple pipelines feature.

Users already rig logstash->logstash communications up for a variety of purposes. Linking individual pipelines is something users will want to do as well.

In a broader sense the problems these users are solving by linking pipelines together can best be, in my estimation, encapsulated in two terms:

The key question of this issue is: Are these problems best solved by inter-pipeline communications?

Some alternate approaches:

  1. Allow users to create subpipelines within pipelines.
  2. Don't try to solve composition, just solve isolation by automatically scheduling / distributing work
  3. Don't try to solve isolation, but add a notion of functions into logstash.
  4. <Your Idea Here />

An initial approach at solving this with our current infrastructure is present here: https://github.com/logstash-plugins/logstash-integration-internal/pull/1

Hypothetical configurations for it are presented here as thought exercises for its use: https://gist.github.com/andrewvc/b8c31706f8b6c8d5c5a3750643247832

I'd like to focus, however, on the big questions I've presented here before we move forward with that.

I think its likely that we may come to a conclusion of "we don't know". In that case I think it could be a useful exercise to release the internal input/output as non-bundled plugins and let users play with them, learning from real-world use cases what we should build.

jakelandis commented 6 years ago

huge +1 to solving isolation, especially between outputs. [1]

w/r/t to composition (as the hypothetical configuration implies), it seems like it could complicate things. Both for the developers (core and plugin), and for users. Are inputs/output/filters are the right level of abstraction to compose? For example, an output that pushes to another pipeline, is it really an output or is just a means to achieve something else ? If we solved isolation, I would be curious to know the use-cases composition is solving. (If it just a saving a bit of copy-pasta, there are less complex ways)

[1] I would hope that isolation concerns are handled internally without the need user's input (unless there are different isolation strategies they want to configure)

guyboertje commented 6 years ago

I had a stab at solving this problem 7 years ago in my POC library called Lapaz. This describes an analogue to a LS pipeline called a route

route(:route_name=>"purchases") do
  from Processor::Purchases, {:seq_id=>0, :name=>'start'}
  to Consumer::Forwarder, {:seq_id=>1, :forward_to=>'prices/start', :reply_to=>'purchases/render/3.3'}
  to Processor::Contacts, {:seq_id=>1, :mux_id=>'3.1'}
  to Processor::StockItems, {:seq_id=>1, :mux_id=>'3.2'}
  to Processor::TemplateRenderer, {:seq_id=>2, :name=>'render'}
  to Processor::LayoutRenderer, {:seq_id=>3}
  to Consumer::MongrelConsumer, {:seq_id=>4}
end

A Component reads events from a queue topic <route>/<sequence_id> and writes events to the next queue topic `/<sequence_id + 1> unless the Component is Consumer, these are supposed to send the event to the outside world.

from and to are DSL words that accept a Ruby class and a config hash. the config hash is explained as:

:seq_id - is the sequencing order of the components.
:name - is an address given to the component that allows other components in this
    or other routes send events to it.
:mux_id - is a way of allowing demultiplexing
:forward_to - is a way to forward the event to another component in another route
:reply_to - is a way to tell the forwarded route where the reply destination is.

Things I learnt:

I guess that we need to know if its possible to know enough from the plugin and its config to automate the parallelising of two plugins in series on behalf of the user.

Maybe this further supports the idea of a PQ or a memory based one with channels/topics. An isolated sub-pipeline without outputs would read from one fixed channel/topic and write to another context based one. Ascii art below, assume that the same JDBC lookup is needed for events from PQ@m1 and PQ@m2.

PQ@m1 --> Grok >-- Q(jdbc-lookup)
PQ@m2 --> Grok >-- Date >-- Q(jdbc-lookup)

Q(ack-this) --> Acker, acks to PQ@m1|m2 (contextually derived)
Q(esout) --> ES output >-- Q(ack-this)
Q(geoip) --> Geoip >-- Useragent >-- Q(esout)

Q(jdbc-lookup) --> JDBC lookup >-- Q(contextually derived, may be esout or geoip)
guyboertje commented 6 years ago

Something like this, perhaps...

channeled-pipelines

robcowart commented 6 years ago

I can definitely see this feature adding value as far as managablity of pipelines is concerned. The filter section of the Pipeline for my ElastiFlow solution has three main functions: normalize v5 flows, normalize v9 flows and processing of the normalized event. With this as the basis I built a kind of prototype by breaking the config into four separate pipelines with redis handling messaging between pipelines. The pipelines are:

I can demo this to anyone who wants to see it, even created it using Logstash 6.0 central config features.

More thoughts to come.

robcowart commented 6 years ago

Let's look at another use-case. Sometimes you just want to be able to reuse some code in multiple pipelines. I use a number of metadata fields to control the processing of events. These values are initialized in each "collection" pipeline like this...

mutate {
  id => "init_control_metadata"
  add_field => {
    "[@metadata][output_elasticsearch]" => "${LS_OUTPUT_ELASTICSEARCH:true}"
    "[@metadata][output_stdout]" => "${LS_OUTPUT_STDOUT:false}"
    "[@metadata][output_rawcap]" => "${LS_RAWCAP:false}"
  }
}

If I need to change this block, for example adding another metadata value, I have to do so in every collection pipeline. However if I had a way to define this block elsewhere, and simply include it where needed. I would have only one place I needed to edit.

This is only one of many examples where such ability to reuse code blocks would be useful. This doesn't have to be a function. Even a rudimentary type of concatenation, where the external text is inserted inline (maybe using an include statement) would be useful. Functions could be even better.

NOTE: When I was at Micromuse our Probe Rules Files were similar in concept to a Logstash Pipeline. We had an include statement which was a kind of inline text insertion as the rules file was loaded. Leveraging this capability as kind of poor man's function, I was able to develop some complex parsing code that was easily reused in many places. The best example was the rules file code which dissected the interface names from vendors such as Cisco and Juniper, extracting out things like slot, sub-slot, physical port, sub-interfaces/VLANs, etc. From this an intra-device containment model could be built upon which various causal analytics was based. Using include statements this logic was easily inserted into the processing logic for over 1000 different syslog messages, but was able to be maintained in one place.

I don't believe inter-pipeline comms solves this problem. Rather I believe that such code reuse features allow multiple pipelines w/inter-pipeline comms to developed more flexibly and easily.

andrewvc commented 6 years ago

Lot's of interesting stuff in this thread! My thoughts:

@jakelandis : I think @robcowart brought up lots of great points WRT alternate composition approaches. I do think that composition for a graph (which I would argue is what Logstash configs really are, since they have unidirectional flow) looks a lot like inter-pipeline comms.

@guyboertje thanks for sharing those thoughts on parallelism! What concrete proposals would you make in this vein? Also, do you think we need to execute filters in parallel for a single event? What benefits would that obtain over executing individual events' entire pipeline in a separate thread given the rather short processing time for single events? What do you think the optimal granularity for concurrency here is?

@robcowart I love love love all this concrete feedback from the field. I think it says something that inter-pipeline comms are something you're already using. Would you prefer to have multiple logical logstash pipelines that talk to each other, or to have a pipeline have the ability to define sub-pipelines scoped within itself? In other words pipelines being more like a namespace grouping other pipelines that can be linked together.

C-style include statements also seem like they may have their place. There's some trickiness there in that they would need to be templated. Specifically explicitly setting id would be tricky there, but I think there may be some workarounds.

MarcusCaepio commented 6 years ago

Hi all, coming from my thread in elastic discuss ( https://discuss.elastic.co/t/logstash-6-multiple-pipelines-for-one-input/107929 ). When I read in the elastic blog about the E6 release and the multiple pipelines, I immediately started testing it. Unfortunately it ended with the result, that I cannot use the multiple pipes like I want to. Please have indulgence, because I am quiet new to the hole ELK world. So I consider everything quiet sober. My situation in short:

Hope my thoughts are understandable :)

Cheers, Marcus

hrak commented 5 years ago

There seems to be a solution for the single input -> multiple pipelines use-case in Logstash 6.3

https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html