elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
67 stars 3.5k forks source link

Logstash 2.3.4 vs Logstash 1.4.2 #5682

Closed AGirin closed 7 years ago

AGirin commented 8 years ago

We are working on a project to update our ELK to latest version. So far we have mixed results with Logstash. Basically we are running multiple servers side Logstash instances. Most of them are 1.4.2 and one is 2.3.4. If you look at the graph below you can see that 2.3.4 instance has extremely low processing volume. No matter what we do it never changes. We use exactly the same configs across all Logstash servers and they all read from Redis. Somehow it feels like new Logstash 2.3.4 cannot get information from Redis fast enough? We reproduced the same behavior in Staging as well. Any ideas are appreciated.

image

We also had the same result with Logstash 2.3.3

untergeek commented 8 years ago

This has to do with some changes to the pipeline architecture. Please see our new performance tuning guide, written specifically to address these changes: https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html

There's also a blog post on this subject coming soon.

AGirin commented 8 years ago

Thank you for the link. The problems is - we run this on the same hardware and same settings applied across the board. When we stop 1.4.2 instances 2.3.4 starts processing all messages just fine. We bring 1.4.2 back up and the volume for 2.3.4 goes down. Its like 1.4.2 does not allow 2.3.4 to get to Redis somehow.

And all the parameters CPU, memory, disk access etc are within green area on all servers. We tried using -w and -b parameters with no luck - 1.4.2 just "blocks" 2.3.4 from Redis.

Also when we run all servers at 2.3.4 we see Redis queuing constantly - like 2-100 messages always sit waiting to be processed. With 1.4.2 it never happened - Redis had 0 items in the queue. Again - I think there is something wrong with Redis input with 2.3.x.

From what I read 2.3.x supposed to be so much faster but the result is actually the opposite. We even compared same configs with 1.4.2. and 2.3.4 when reading thousands of lines a second and 2.3.4 was not outperforming 1.4.2 at all. They were kind of at the same numbers. I am questioning now the direction Logstash goes - it appears to me there a lot of incompatibilities/gotchas/bugs that stops anybody from moving forward with 2.3.x installation.

We have more than 600 servers running Logstash 1.4.2 and ES 1.7.2 and we spend a lot of time patching issues/bugs. With 2.3.x a lot of those issues (like throttling filter not supporting mutlithreading) are not addressed still and we have to rely on the community to fix it.

It would be great if somebody from Elastic wrote a blog on how to upgrade ELK from older version to new ones and give some data on performance comparison - and I mean not 2-3 server installation but big installation and see how "easy" the upgrade process is going to be. At least it would give us an idea on what to expect from new ELK. So far results are disappointing.

untergeek commented 8 years ago

I see. You're trying to get 1.4.2 and 2.3.4 to both access Redis at the same time. The 2.3.4 method is totally different, and the newer plugin allows batching, where 1.4.2 is probably blocking due to the polling method (hence the unavailability you're seeing in 2.3.4).

You'll probably have to switch all at once, or somehow parallelize your Redis instances (use different keys, for example) to migrate.

untergeek commented 8 years ago

Also, please watch this informative YouTube video (from Logstash creator, Jordan Sissel himself) about the changes to the Pipeline Architecture.

AGirin commented 8 years ago

Thanks that makes sense. How does it explain that 2.3.4 is not processing all items in the Redis queue?

Example:

1.4.2 Redis queue: image

2.3.4: image

We use the same script to get stats from Redis and actually the same Redis instance. With 2.3.4 there is almost always something in the queue. Like Logstash does not get all items from the list somehow.

untergeek commented 8 years ago

If you compare the code in your 1.4.2 setup (lib/inputs/redis.rb or something like that), vs. what is now available at https://github.com/logstash-plugins/logstash-input-redis you'll see that many changes have been made. Logstash is not blocking on read, and batching differently. A queue of 10 to 100 is not huge, and Logstash reads in batches of 100 by default. This is a tunable parameter.

AGirin commented 8 years ago

Thanks for the video and explanation. So am I getting it right?

2.3.4

Redis Input - threads 10 - means that I still have 10 workers connecting to Redis and pushing events to Filter (threads = number of processor by default) --> Output (Elasticsearch, default workers is 1 but we set it to 2).

untergeek commented 8 years ago

threads => 10 will spawn up multiple "instances" as it were of the Redis input plugin, each with the settings you specify. You may have better mileage from fewer threads and a higher batch_size, which according to the code defaults to 125, but the documentation says is only 1.

-w is "worker threads", and this is shared across the pipeline. You can increase the workers => 2 to the same number as -w, if you only have the one output block. The total number of workers => # across all outputs (and each has 1 by default) should not exceed the number assigned to -w.

Also, you can increase --pipeline-batch-size, or -b. The default is 125, but that is per worker thread. This means that if you have -w 10, you will be buffering up to 1250 lines (125 per thread, each sent as a batch) . Logstash will issue warnings if you go over 10,000 buffered lines, but will continue to work. It is recommended to increase your heap size if you're buffering larger batch sizes, especially as you approach 10,000 total buffered.

jdlindu commented 8 years ago

you need to increse --pipeline-batch-size or --pipeline-workers. detail here https://www.elastic.co/guide/en/logstash/current/pipeline.html

AGirin commented 8 years ago

Well - batch_size does not really work. And we did test that before. Here is an example:

image

B - Before the change. 3 servers each have 20 Redis threads (-b 500 -w 16 on a 8 core systems) A - One server has 1 Redis thread and 5000 batch_count. You can clearly see that it lags as far as throughput.

So again - nothing really improved as far new Logstash. I did see stuff about new pipeline - so far I cannot prove to myself that it is any faster/better that 1.4.2 pipeline. All tests show the opposite.

untergeek commented 8 years ago

Have you tested throughput without using the Elasticsearch output? One of the biggest changes since 1.4.2 is that the Elasticsearch output now retries unsuccessful writes. I repeat, 1.4.2 does not retry unsuccessful writes, but newer versions do. Given the volume you're trying to send through Redis, it may be that the bottleneck is the Elasticsearch output. If you have debug logging enabled (or perhaps even just --verbose), you should be able to see if the Elasticsearch output is retrying writes.

AGirin commented 8 years ago

Now I am really confused: I set batch_count to 1 and threads to 1. This is what I got:

image

You can see that with batch_count 10,000 the server was not processing a lot. Now with new setting of 1/1 it processes a lot more that the other two servers (which have 20 Redis threads). Hmmm.... Any ideas?

AGirin commented 8 years ago

@untergeek - we process a LOT more that I post here. This is just test results and I have full confidence that we get all data in ES without delay. If you see my previous post - thread =1 and batch_count = 1 is performing much better than threads = 20 and batch_count = 1. Now this is a puzzle.

untergeek commented 8 years ago

If you see my previous post - thread =1 and 1 batc_size = 1 is performing much better than threads = 1 and batch_size = 1

Are they supposed to be different? Hammering the current Redis input with 20 separate threads may not be necessary. I will try to run some tests against a local redis today.

guyboertje commented 8 years ago

@AGirin - LS 1.4 does not directly compare with LS 2.3 in terms of what you need to know to get the best performance from LS.

When making such a big jump in versions you need to throw out the old knowledge and build anew.

First starting with the output - by default the outputs are not shared, one output is used sequentially by the workers (-w N or number of cores) - this means that only one worker is actually transmitting to ES at any given time, however this can be tuned. Also, LS 2.x <-> ES 2.x may have higher roundtrip latencies because ES waits for a confirm from all shards before the bulk_index request completes.

Next the filters, LS 2.3 will run multiple worker threads - each thread will grab a batch of events from the synchronous queue and process the batch through the filters. When done filtering the thread tries to acquire the output - it must wait if it can't. Please note the significance of the synchronous queue - an input thread and a worker thread MUST both be at the queue ready to exchange an event (the input thread puts it in and a worker thread takes it out). If the input threads are doing something else (like talking to redis) then the worker thread must wait. Conversely, if the worker thread is doing something else (like talking to ES or waiting to talk to ES) then the input thread/s must wait. Each workers round trip time from getting the last event for the batch to returning to get the first event for its next batch is a function of A) how long the filters take to process the batch (unpredictable because each batch contains different kinds of events); B) how long to wait until access id given to the output and C) how long it takes to serialize the batch and transmit it to ES.

Next the Redis input. By setting more than one Redis "thread" one, in fact, creates a complete duplicate input and thread. Each input thread will wait until it can sync up with a worker thread to hand over an event - please note this hand over is done one event at a time. Each redis input (thread) gets 125 docs from redis for each communication (this can be changed).

Redis, being single threaded, will fulfill one redis input thread request at a time and the LS synchronous queue will fulfill one redis input thread put request at a time. Your 20 redis input threads are in effect fighting each other for access on either side.

In order to properly tune your installation I suggest that you add the stdout output with the dots codec with the linux pv utility: output { stdout { codec => dots } } + | pv -Wart > /dev/null

Comment out the ES output, establish a baseline with 1 worker -w 1 and one redis thread and leave the batch size at the default. Now adjust only one setting at a time.

Once you have a good idea of how the 1) worker count, 2) pipeline batch size, 3) redis batch size, 4) redis input thread count variables interact with each other; you can uncomment the ES output and tune the 5) ES output worker count, 6) flush size and 7) idle_flush_time.

If you can, try to send exactly the same set of events through redis for each tuning run. Be systematic and post your results here.

AGirin commented 8 years ago

Well - I am getting somewhere.

I removed all threads and batch_count from all configs on one Logstash server. Throughput went down and other two servers were processing a lot more data (45%/45%/10% - new config server).

From here I added just one config line to "new config" server - batch_count => 1. Now this server is processing a lot more than the other two:

image

So somehow batch_count is not documented correctly - I think it is 125 and not 1 as documentation states. And setting batch_count => 1 explicitly made things a lot better and more promising.

Also I moved config of that "new config" server back to default number of workers and default batch (so logstash init script does not have -w or -b parameters). And with these settings "new config" server is processing ~63% of all messages. The other 37% are processed by other two servers with old configs.

As a side note I had to increase timeout => 300 (from default 5) because I started to see a lot of this messages after I changed batch_count => 1:

{:timestamp=>"2016-07-25T12:43:47.231000-0500", :message=>"Redis connection problem", :exception=>#, :level=>:warn}

Timeout 300 fixed the issue.

untergeek commented 8 years ago

I ran some tests on my laptop (9 month old Macbook Pro). Each test was against 1000000 lines. There is no grok or processing of every kind. The output was measured with stdout { codec => dots } piped through pv -Wr > /dev/null

The following tests were run against Logstash 2.3.4 with Redis 3.0.6. Subsequent tests with 1.4.2 are in progress.

Testing Redis input (batch_count and threads changed for each test):

input {
  redis {
    host => "127.0.0.1"
    data_type => "list"
    key => "redis_test"
    batch_count => 250
    threads => 8
  }
}
output { stdout { codec => dots } }

Testing Redis output (batch, batch_count, and workers changed each test):

input {
  generator {
    message => "This is a rather long message that is only a big string."
    count => 1000000
  }
}
output {
  redis {
    host => "127.0.0.1"
    data_type => "list"
    key => "redis_test"
    workers => 8
    batch => true
    batch_events => 250
  }
  stdout { codec => dots }
}

Thread Tuning only:

No tweaking of -b, batch_size, or batch_count was done in these tests. batch is false (the default) for the Redis output in this series of tests.

Batch Tuning - 125:

Note that this is the Logstash default batch size (no -b 125 necessary). Only the batch_size, batch_count and workers/threads were changed.

Batch Tuning - 250 (using the command-line flag -b 250):

Notice that larger batch sizes are actually a regression for the Redis input, and that the 125 default batch size is a slight performance increase over the default batch size for input. For outputs, increasing the batch size does improve performance, but the performance difference between a batch size of 125 and 250 is negligible.

I'm not sure which version of Redis you're using, but there is a PR for forcing a batch count of 1 when using older versions of Redis

AGirin commented 8 years ago

@untergeek - I mistyped and corrected. Sorry about that.

If you see my previous post - thread =1 and batch_count = 1 is performing much better than threads = 20 and batch_count = 1. Now this is a puzzle.

AGirin commented 8 years ago

@untergeek and @guyboertje - Thanks a lot for such a valuable input! Threads setting came from our Logstash 1.4.2 installation since this was the only way we could make our setup work and process all those millions of messages.

It looks like with 2.3.4 redis threads are not needed anymore but batch_count looks to be the culprit. At least in my tests it shows that setting it to 1 (with everything else default) makes 2.3.4 a lot faster and it beats our old configs by far.

I will continue testing and let yo know my findings.

AGirin commented 8 years ago

@untergeek - we are using Redis 2.8.11

What Redis version do you guys consider old? 2.8.x? I know they have 3.x now.

untergeek commented 8 years ago

@AGirin If you look at my test output above, a batch_count of 125 shows a slight improvement over 1. I tested 125 because the current default is 125:

-b, --pipeline-batch-size SIZE Size of batches the pipeline is to work in.
                                   (default: 125)

I would recommend testing in between 1 and 125 to see if there's a peak and a fall off in between.

AGirin commented 8 years ago

@untergeek - I want to make sure it is clear - when I was saying threads I meant threads in Redis configuration (threads => 1) and NOT -b 1. In case somebody else is following this thread.

So in my tests I had -b set to 125 (default) and in Redis config I had only batch_count => 1 (using default threads number of 1).

AGirin commented 8 years ago

@untergeek - Were you changing -w for logstash init? Or by threads you mean threads => xxx in Logstash Redis input config?

untergeek commented 8 years ago

Logstash tries to detect the number of cores and sets -w to that number by default. In other words, I was always at -w 8. That said, my configuration was either changing threads or workers (depending on whether it was an input or output plugin, respectively) to a higher number.

guyboertje commented 8 years ago

@AGirin Redis 2.8.11 is fine.

ghost commented 8 years ago

Is my understanding correct? There's a lot of great information in this discussion and I want to make sure I got it right. @guyboertje

There are three distinct stages in the pipeline - input, filter, and output.

Usually, there is only one input thread per an input. The threads=> setting in the redis input will in fact run multiple threads in parallel. This has nothing to do with the pipeline workers.

The -w setting comes into play in the filtering and output process. Say, if I have 10 workers but only one input thread, the workers somehow have to sync with that one input thread.

Lastly, only one worker can output to Elasticsearch per a time, that means when output is being done by a single worker, it is blocking the others.

guyboertje commented 8 years ago

@AxFx Yes. Correct. Tuning for maximum throughput is hard, because ideally you want an input thread to be arriving at the synch queue at about the same time as a worker thread does. The loop time of the input thread depends on the input data source latency and the size of the discrete event or data chunks. The loop time of a worker thread is dependent on filter complexity and output latency.

AGirin commented 8 years ago

I ran a lot of tests and the final is:

  1. Use multiple Redis threads for heavy data sources. 1 thread cannot process the amount of information we have. On the good side we removed Redis threads from most of the Logstash server configs and enabled multiple threads only on some of them.
  2. In my tests -b 500 looks to be the winner. Anything above that was not giving much improvement and at some point it starts hurting the system. the difference between -b 125 and -b 500 in our environment is ~5% of more throughput.
  3. One troubling thing is this: https://github.com/logstash-plugins/logstash-filter-throttle/pull/5 We use throttle filter a lot and with 1.4.2 we had to use custom fix to make it work with -w > 1. With 2.3.4 there is also custom fix by @frapex but what we see is it crashes Logstash under very heavy throttling conditions. So far there is almost no activity on throttle filter and what is included with Logstash is NOT threads safe so -w > 1 is not allowed which in turn kills throughput to an unacceptable levels. Not sure how throttle filter became community supported (I believe it was not before) but this would be great if it is addressed in the near future. We are committed to ELK stack and have very large installation of it and are willing to test any patches/fixes that might become available.

Thanks again for all the great info!

guyboertje commented 8 years ago

@AGirin - reading back through the very earliest comments - you questioned whether 2.3.x is better than 1.4.2 for your case.

It would be really helpful for future readers if you could write an update, providing that you did find that 2.3.4 is better (given the correct tuning of course). I don't feel that you need to include tons of detail.

AGirin commented 8 years ago

@guyboertje - I will.

guyboertje commented 7 years ago

Closing - no feedback (v2 must be better than v1.4)