Open andrewvc opened 8 years ago
Updated benchmarks for a simple (generator -> kv -> mutate -> stdout(dots)) pipeline.2.13x faster. This shows that the gains for simpler pipelines are larger (as expected).
~/p/logstash (native-graph) $ time bin/logstash -w 2 -f simple-kv.yml | pv -b | wc -c
137.29 real 300.15 user 47.45 sys
9.54MiB
10000107
~/p/logstash-alt (fix_oops_backtrace_logging) $
time bin/logstash -w 2 -f simplekv.conf | pv -b | wc -c
292.70 real 647.86 user 53.70 sys
9.54MiB
10000107
Config files available here
After giving it some though, I think I like the idea behind of this change, it also compliments perfect with the component idea stated in #4432. But I do really think we should not hurry too much into adding this change, I will try to explain my thoughts later on.
For now I think I would focus us on building the most flexible IR DAG model possible, then performance will come out of the box for sure. What do you think?
@andrewvc and I have discussed the connectedness of nodes (components) with the DAG.
Background: As I have a strong electronics background, I have a mental model of components "wired up" to each each other so that the pipeline does not orchestrate the collection of data from an upstream component and feed it to a downstream component - rather the upstream component knows its downstream connections and feeds data downstream autonomously. However the pipeline does have to build the connected DAG. Here the DAG is more than a data structure - its an execution structure.
Andrew on the other hand, is more comfortable with the idea of a supervisory pipeline that does orchestrate as above. In this design, components are still specific wrappers that present a common communication interface and hide the API of the underlying "unit of work" that it wraps.
Conclusion: I have conceded that while the "hardware" notion of a connected DAG has some appeal - it is hard to construct, inspect and reason about at runtime. Therefore I am fully throwing my weight behind the idea of a DAG as a data structure with the pipeline orchestrating the flow of data through the graph.
We did agree that the nodes of the DAG would contain filters and output components but also could contain other components like for example metrics and trace logging components. We also agreed that the stitching in of components on a running instance of LS could be done dynamically.
Thanks for the comments @purbon and @guyboertje ! I agree completely with all your concerns @purbon all the code/config language I've put in so far is experimental and meant to be changed and discussed!
I should have included, however, that this graph language might be a preferable config language for administrators in a textual form as well however!
I found some time while in Lake tahoe to experiment (https://github.com/elastic/logstash/pull/4727) with what I'm terming a 'graph' pipeline execution model. The critical ideas behind this design are as follows:
Implementation
The current model compiles the Logstash config to ruby code, then repeatedly executes it. An abbreviated version of the execution is what is shown below
Note that the entire filter chain for a single event is compiled into a single function,
filter_func
. This is also true ofoutput_func
, though in that case it returns which outputs the event should be sent to rather than directly executing it.The graph model is more simplistic. We model the entire pipeline as a graph (from inputs, to queues, to filters, to outputs. The IR serialized to yaml might look like the following:
With the graph expressed as such we can execute according to any strategy of our choice.
Why is it Faster?
I'm not sure. I know that implementing a similar graph pattern in ruby yielded similar results. My money is on some inefficiency in the generated ruby OR the greater CPU cache locality afforded by filtering batches. Either way I believe regardless of performance the other benefits stack.