Logstash benchmarking/profiling efforts

purbon commented 9 years ago

Preface

This issue is intended to wrap up the benchmarking and profiling efforts around the Logstash project and it's ecosystem. This is actually a very important, and relevant topic, that will serve as a common ground for engineers, decision makers, developers and people interested to know what can they archive with Logstash.

The internet is nowadays full of benchmarking done with the technical people not in mind, unlike them in this efforts we will have special consideration to be open, by providing the necessary tooling and data available, so you're actually able to perform the same analysis as we do and archive your own conclusions. Your feedback is going to be very valuable.

Objective

There are a set of objective to be archived with this efforts, I'm going to summarize them here (for more details please check related issues when available) :

Benchmarking the logstash core pipeline: Logstash is basically a data transport/processing pipeline, the first step to benchmark/understand is how performant is the raw pipeline. Here we encounter several challenges like the memory consumption, log formats, encoding, codecs, etc.
Benchmarking the most popular plugins (inputs/filters/outputs): The next step would be to see how logstash is performing with the most common set of plugins, because none is just using the default pipeline and everyone is using a configuration makes sense to see how logstash is doing with the most popular ones.
Methodology

To build this benchmarks we aim to build different use case combinations with the intention to match real life usage. The initial bullet points are:

Log lines can be of different sizes, for this benchmark we aim to use log lines of 0.5 Kb, 1 Kb, 2 Kb, 5 Kb and 10 Kb.
Number of events can vary a lot too, here we'll build versus 1_000 events, 10_000 events, 100_000 events and 1_000_000 events.
Ingestion rate is another moving part for logstash, here we aim to target several ingestion rates where you basically can fetch one of the previous variables and target logstash in a given time frame.

But we should not forget data formats, this vary a lot and can also be a source of performance analysis. From our experience most common formats are:

Apache/Nginx logs.
Json logs.
syslog, rsyslog formats.
Nxlog/Event logs for windows.
Firewall logs (most common vendors)
Java (log4j) / dotNet / Application logs (for this one we can take ES as an example)

And last but not least we should consider in which kind of configurations people is using logstash, for this we aim to test in several different ones (including machines, virtualization, JVM and JRuby versions, etc).

Tooling

To run this analysis we will need to build/integrate some tooling, here is the list of the ones we'll need for:

A synthetic log generator: Logstash can ingest a very diverse set of data formats, sizes, etc. To enable the run of this benchmark, and also the distribution of it, we aim to create a synthetic generator that will enable us to generate load to run the tests. This tool will enable the reproduction of test without complications.
A benchmarking framework: At logstash we believe on getting things faster and better every release, so to archive this we'll get you/us with a framework that will let us run the set of benchmarks described here.
A regression framework: Running benchmarks is just another kind of test you should run in regular basis, who has never faced with performance regressions. As we aim to catch them as early as possible, we'll also build a regression framework to use as a complementary test.
Public dashboards: The last point will to run the benchmarks in regular basis and make the outcome public in a similar way as you can get for elasticsearch at http://benchmarks.elasticsearch.org/
Annex 1 (Profiling/Benchmarking grok)

Is well known that Grok can be slow, if not used properly, we aim to provide also an annex benchmark on this filter to help people see how fast do they grok expression go.

Annex 2 ( Most common plugins )

As seen the most popular plugins are:

Inputs: file, tcp, udp, syslog, lumberjack, windows eventlog, elasticsearch, redis, kafka, rabbitmq.
Filters: grok, translate, geoip, json, multiline, xml, kv, geoip, date, csv.
Codecs: multiline, json, json_lines
Outputs: elasticsearch, redis, rabbitmq, kafka.
Annex 3 (Most common used HA setups)
Kafka
RabbitMQ
Redis

We're going to keep updating this as the benchmarking initiative gets going on, including related issues, numbers archived, etc. Contributions are more than welcome, feel free to report here your feedback, ideas, etc.

Happy benchmarking!!

Related issues: #3477

ph commented 9 years ago

Can you add the lumberjack plugin to the list of the common inputs?

andrewvc commented 9 years ago

@purbon I like this approach! With logstash 2.0 coming up, are there any plans to build some basic benchmarking counters / timers into logstash itself to aid in this effort? I can imagine that some basic throughput numbers could be useful over a future Logstash API

colinsurprenant commented 9 years ago

a few thoughts:

benchmarking against a fixed count of events does not provide any meaningful information. for any benchmark on the JVM and also specifically with JRuby to have any meaning, these tests have to run for a long enough period of time to allow the JVM & JRuby to do its JIT'ing and settle down. So any benchmark "sets" should always be run for approximately the same period of time to be comparables. "long enough period" can be determined by looking at the TPS variation in time for a particular benchmark and then run the benchmark sets for a period long enough for the TPS to settle. In my experience it is usually bigger than 30 seconds and in the 1 to 2 minutes range. This also mean that you want to consider what metric you want to report, the earlier TPS measures will be slower than the last ones. Do you report the average? or the top percentile? check the ideas about this in https://github.com/elastic/logstash-integration-testing/
a benchmark ingestion rate should always be at the maximum rate that the pipeline can ingest. In other words, we should always try to cram as much data as we can to the input (when its in our control) and let the pipeline do its back pressure duty. There is absolutely no value in trying to reduce the input rate as the benchmark result will simply be that reduced input rate: for example imagine you send 1 event per second to logstash, then obviously the benchmark will report 1 TPS. this will be true up until you hit the ingestion limit where regardless how much you feed, it will report a maximum of X TPS, and X is what you want to know in a benchmark.

andrewvc commented 9 years ago

@colinsurprenant all good points on the JIT.

It may also be good to lower -XX:CompileThreshold to force JITing to happen a little sooner given that Logstash is usually a long-lived process.

colinsurprenant commented 9 years ago

@andrewvc I think that changed to jruby.jit.threshold see https://github.com/jruby/jruby/wiki/PerformanceTuning#jit-runtime-properties in any case, since we normally operate at high TPS, the 50 default is normally hit right at the start and does not make much difference.

In any case, having a proper benchmarking system in place will provide a good environment to play with all these options, including the JVM GC options and see what performs better.

suyograo commented 9 years ago

@purbon nice write up!

Number of events can vary a lot too, here we'll build versus 1_000 events, 10_000 events, 100_000 events and 1_000_000 events.

Like @colinsurprenant mentioned, we should not try to control the flow of events to these ranges. However I do think we should have a fixed, large data set -- say 10 million apache log lines and have this ingested into LS using common configurations to catch regressions. The danger in performance testing is having too many variables and trying to tune them all at once. This will become too ambitious. We need to restrict the number of variables at play here. Most of our performance issues are regressions because we don't have historic data captured and we don't know baseline numbers. I would suggest we focus on this problem first. We are not trying to publish official benchmarks for any of the plugins or config combination (yet).

I think the primary goal for this exercise should be to build an infrastructure that allows us to easily record performance (throughput) for a static set(s) of configuration using static set(s) of input data nightly. We can then start playing with the knobs. Of course, if we have 10m log lines it will take a while for LS to process it, so this will account for JVM warm up, GC cycles etc.
As a next step, if we can extend this perf suite to automatically run per commit, it will be great.

purbon commented 9 years ago

@ph, lumberjack added. Makes sense! Do you have any other in mind?

purbon commented 9 years ago

@andrewvc This is actually one of the ideas in the long run, when we provide API for logstash, giving numbers as:

Events per second,
Events pending in the queues,
Avg times in each stages,
etc...

will be really beneficial and profitable for everyone. If I don't recall wrong, there are some issues open for this under the roadmap tag.

purbon commented 9 years ago

@colinsurprenant First of all thanks for your thoughts, let me go throw them and try to explain myself a bit.

Should say this initial issue was not intended to go in a lot of details, the target here is to be a meta issue planting raw ideas, but without going into details. Obviously I'm having warm up as a very important face of this benchmark, as it should be for any kind of benchmarking. Let me explain a bit here my approach on how to deal with warm up for this benchmark, for a more detailed discussion I would say we move this to the specific issue I'm going to open for the benchmarking framework.

The approach used to deal with warm up our first benchmarking efforts, initiated by you, deal with this first phase by integrating it into the execution phase. In a nutshell the user sets a long enough time and the finally numbers (aka top TPS, avg TPS, etc..) are going to be more correct as longer the execution goes. My intention is to make warm up a first class citizen of this benchmark, by providing an interface to the user that explicitly lets him decide how to execute it. I like this approach as it makes this very important face explicit. No need to say the former approach is also valid, but I my idea right now is to do it like this.

Speaking about the number of events, here we also have two different valid approaches, we can make the base variable the time, or we can see how Logstash performs by injesting a given number of events. As you said, by using time as the base variable, we aim to see the maximum throughput, still bound the the timeframe used, but as longer as it runs, more close to the maximum would be expected. However, and with the intention to provide numbers to the end users, using the numbers of events as a variable lets the user see how LS perform for their expected numbers of events. Eventually you will reach the maximum by augmenting the number of events. Again both approaches are ok and provide meaningful points of view for different actors.

I don't close the door to include the time approach as to the benchmarking framework is a valid one that together with this one will provide a lot of information to the users of this.

I hope I explained myself properly, let me know if you have more concerns about the meta idea, and lets move detailed discussion to their own issues. Your contributions are always interesting!!

/cheers

purbon commented 9 years ago

@suyograo see https://github.com/elastic/logstash/issues/3499#issuecomment-115163476 for more thoughts on why I put this numbers to run the benchmarks, let me know If it needs furder explanations.

I would say both approaches are completely ok and providing meaningful information for different stakeholders, I will not discard providing numbers based on number of events so easy without a bit more discussion the benefit this can provide.

See http://ldbcouncil.org/benchmarks/snb as a good example on running bechmarking, I do really like the way they approached. No need to explain they are benchmarking a database and us a pipeline, what for them are queries, for us is the configurations/plugins used that play a really important role for LS performance.

purbon commented 9 years ago

@suyog, the idea is to have the load syntactic generator parametrized, so it can generate the number you want. However to publish data I would keep to this numbers of events for now. As said before, like this people will related, oh! I have LS and i aim to ingest 1_000_000 events, lets see how long it will take. Makes sense?

purbon

On Thu, Jun 25, 2015 at 7:44 AM Suyog Rao notifications@github.com wrote:

@purbon https://github.com/purbon nice write up!

Number of events can vary a lot too, here we'll build versus 1_000 events, 10_000 events, 100_000 events and 1_000_000 events.

Like @colinsurprenant https://github.com/colinsurprenant mentioned, we should not try to control the flow of events to these ranges. However I do think we should have a fixed, large data set -- say 10 million apache log lines and have this ingested into LS using common configurations to catch regressions. The danger in performance testing is having too many variables and trying to tune them all at once. This will become too ambitious. We need to restrict the number of variables at play here. Most of our performance issues are regression because we don't have historic data captured. I would suggest we focus on this problem first. We are not trying to publish official benchmarks for any of the plugins or config combination (yet). I think the primary goal for this exercise should be to build an infrastructure that allows us to easily record performance (throughput) for a static set of configuration using a static set of input data. We can then start playing with the knobs. Of course, if we have 10m log lines it will take a while for LS to process it, so this will account for JVM warm up, GC cycles etc. As a next step, if we can extend this perf suite to run per commit, it will be great.

— Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/3499#issuecomment-115113466.

colinsurprenant commented 9 years ago

I still believe that thinking in terms of number of events is somewhat futile. "nobody" who deals with streaming data think in fixed number of events, but in TPS.
to be relevant, these fixed set of events would have to be large enough, typically in the millions so that the run time would be long enough to have any meaning.
you really don't need a 10M lines log file to do a benchmark, a 100 or 1000 lines file that you repeat continuously is just fine and way easier to carry around, it can also be part of the project. Obviously that also depends if the input plugin that need to be benchmarked in which case the input format will depend on the plugin.
for the warmup, I am not sure I understand the concept of first class citizen and parametrized warmup time - in any case, the warmup phase cannot be dissociated from the execution phase, it has to be the same continuous process and you have to figure out that in that total run time period, which part is the warmup and which part is the benchmarkable run time, then you decide how you want to play with the numbers to account for the slower part of the warmup time.

purbon commented 9 years ago

Hi, let me go throw your comments:

On Thu, Jun 25, 2015 at 3:45 PM Colin Surprenant notifications@github.com wrote:

I still believe that thinking in terms of number of events is somewhat futile. "nobody" who deals with streaming data think in fixed number of events, but in TPS.

TPS will be one of the variables reported after the test execution.

to be relevant, these fixed set of events would have to be large enough, typically in the millions so that the run time would be long enough to have any meaning.

With the option to run N number of events, you actually let the benchmark user to stress LS as hard as they want. This feature goes also attached with the capacity of matching real live logs, I mean things like messages of different forms, having poisoned messages here and there, and much more. This is something critical to any benchmark.

you really don't need a 10M lines log file to do a benchmark, a 100 or 1000 lines file that you repeat continuously is just fine and way easier to carry around, it can also be part of the project. Obviously that also depends if the input plugin that need to be benchmarked in which case the input format will depend on the plugin.

I disagree, replying 10 lines over and over again, is also not the best approach here.

for the warmup, I am not sure I understand the concept of first class citizen and parametrized warmup time - in any case, the warmup phase cannot be dissociated from the execution phase, it has to be the same continuous process and you have to figure out that in that total run time period, which part is the warmup and which part is the benchmarkable run time, then you decide how you want to play with the numbers to account for the slower part of the warmup time.

What I'm talking here is about having a command like like:

benchmark --warm [description] -process logfile

like this you know how the system behave with different warmup. Obviously you can not run the warm up, stop LS, and then process files, this makes not sense.

purbon

— Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/3499#issuecomment-115262641.

colinsurprenant commented 9 years ago

the idea of a benchmark is to be able to compare it to other benchmarks and for it to have any meaning you have to be able to reproduce the benchmark using the same input so using "real live logs" might not be the best idea, unless you can reuse that same input (or did you mean something else by "real live logs"?)

I disagree, replying 10 lines over and over again, is also not the best approach here.

I didn't say 10, but depending on the benchmark you do, a log sample set size can actually be totally valid at 10, 100, 1000, 10000 lines. For example, I would be very surprised if a TPS benchmark shows any significant difference between a 10M syslogs or apache logs file versus a 1k sample that you replay. Anyway, my point here is that having smaller sets which are statistically valid sample sets, is a lot more easier to manage, and you can actually include them in the project.

colinsurprenant commented 9 years ago

All-in-all great initiative! The scope is very large so let's make sure we iterate and target the most useful metrics and start collecting these to see trends and immediate performance improvements/regressions.

Also, let's see what can be reused from https://github.com/elastic/logstash-integration-testing

About the "synthetic log generator" idea, I know @rashidkpc had a neat one which could actually generate "realistic" or "interesting" log "shapes" for Kibana. Could be useful?

VicentJ commented 9 years ago

hey,fellows. my logstash agent(client) can only read about 350 records(about 85M) per second which is not even close to what we need. any good ideas about how to optimize?

purbon commented 9 years ago

Hi, so you mind opening another issue, to hold this discussion ? there you can share your configuration so we can see if there is any special thing that might made your ingestion rate slow.

purbon

On Wed, Jul 1, 2015 at 11:13 AM VicentJ notifications@github.com wrote:

hey,fellows. my logstash agent(client) can only read about 350 records(about 85M) per second which is not even close to what we need. any good ideas about how to optimize?

— Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/3499#issuecomment-117556217.

VicentJ commented 9 years ago

@purbon hi,no,never mind.thanks for sharing my concern.new issue website: https://github.com/elastic/logstash/issues/3549

elastic / logstash