elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.2k stars 3.5k forks source link

[meta] kv and grok filters timeout handling performance regression #11302

Open colinsurprenant opened 4 years ago

colinsurprenant commented 4 years ago

We have recently discovered a substantial performance regression introduced by #10976 related to the use of the JRuby Timeout.timeout which surfaces in both the kv and grok filters when enabling timeouts (which are enabled by default). Some information on this regression is available in logstash-plugins/logstash-filter-grok#152.

The performance regression affects the kv filter starting at v4.4.0 and the grok filter starting at v4.1.0. The performance regression is compounded when using multiple fallback patterns in grok in particular.

The best temporary solution is to disable the timeout handling by setting timeout_millis => 0 in both kv and grok. Downgrading to a previous version is not recommended because these were producing random crashes in logstash. We understand that this workaround is not ideal as it can also result in pathological patterns to execute for an unbounded period. Please read our guide about grok patterns https://www.elastic.co/blog/do-you-grok-grok which provides tips on how to avoid such slow or pathological patterns.

We are actively into solving this issue, we will report back with progress.

So far we believe the problem is with the JRuby Timeout.timeout implementation. @jsvd found the following:

I believe I found a way to replicate the slowdown. This is due to jruby using an [executor sized by number of processors]?(https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/ext/timeout/Timeout.java#L64) to kill threads (or just yield) when the thread finished before the timeout.

require 'timeout'

NUM_THREADS = java.lang.Runtime.getRuntime.availableProcessors() * 5

threads = NUM_THREADS.times.map do |tnum|
  Thread.new do
    loop do
      Timeout.timeout(0.1) { 1 } rescue Timeout::Error
    end
  end
end

threads.each(&:join)
jstack 58640 | grep Timeout.yieldWithTimeout | wc -l
40

Example thread stacktrace:

"Ruby-0-Thread-34: script.rb:1" #56 daemon prio=5 os_prio=31 cpu=1262.47ms elapsed=27.15s tid=0x00007fd108819000 nid=0x13103 waiting on condition  [0x000070001312f000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@11.0.1/Native Method)
    - parking to wait for  <0x0000000700b72e88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(java.base@11.0.1/LockSupport.java:194)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.1/AbstractQueuedSynchronizer.java:885)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.1/AbstractQueuedSynchronizer.java:917)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.1/AbstractQueuedSynchronizer.java:1240)
    at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.1/ReentrantLock.java:267)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.remove(java.base@11.0.1/ScheduledThreadPoolExecutor.java:1046)
    at java.util.concurrent.ThreadPoolExecutor.remove(java.base@11.0.1/ThreadPoolExecutor.java:1751)
    at org.jruby.ext.timeout.Timeout.killTimeoutThread(Timeout.java:192)
    at org.jruby.ext.timeout.Timeout.yieldWithTimeout(Timeout.java:129)
    at org.jruby.ext.timeout.Timeout.timeout(Timeout.java:99)
    at org.jruby.ext.timeout.Timeout.timeout(Timeout.java:75)
    at org.jruby.ext.timeout.Timeout.timeout(Timeout.java:70)
    at org.jruby.ext.timeout.Timeout$INVOKER$s$timeout.call(Timeout$INVOKER$s$timeout.gen)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:182)
    at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:191)
    at script.invokeOther3:timeout(script.rb:8)
    at script.RUBY$block$\=script\,rb$2(script.rb:8)
colinsurprenant commented 4 years ago

In progress PR for grouped timeouts https://github.com/logstash-plugins/logstash-filter-grok/pull/153 which should help mitigate the problem.

colinsurprenant commented 4 years ago

not that v4.2.0 of the grok filter was published with the new timeout_scope option.