driskell / log-courier

The Log Courier Suite is a set of lightweight tools created to ship and process log files speedily and securely, with low resource usage, to Elasticsearch or Logstash instances.
Other
419 stars 107 forks source link

log-courier and logstash > 7.4.0? #370

Closed sysmonk closed 4 years ago

sysmonk commented 4 years ago

Hi there!

Wanted to know, does anyone still use log-courier? each time i want to upgrade, something is broken which sounds like nobody is actually using it :(

I'm running logstash 7.3.2 now, and wanted to upgrade to the latest 7.5.1. When doing that, i don't get any errors, but.. It just stops working. I've got an empty config like:

input {
      courier {
        port => 9001
        ssl_certificate   => "/home/logstash/config/log-courier.crt"
        ssl_key           => "/home/logstash/config/log-courier.key"
    }
}
output {
    stdout{ }
}

And my log-courier connects to logstash, sometimes it sends some stuff without issues ( can be thousands of lines, with 256 in each batch) but sometimes it gets stuck right away:

Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload ef04a54d4990203b3975327a7413684d sequence 256
Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Average latency per event: 0.24 ms
Jan 21 14:30:23 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:23.578455781 +0000 UTC
Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Sending payload 3541ab8cc0cdfeea609e7a46288af69e (256 events)
Jan 21 14:30:23 logstash5 log-courier[28003]: Registrar received offsets for 256 log entries
Jan 21 14:30:23 logstash5 log-courier[28003]: Maximum pending payloads of 10 reached, holding 256 new events
Jan 21 14:30:23 logstash5 log-courier[28003]: Spooler flushing 256 events due to spool size reached
Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload 1b65490503788f115cdfbc047c72c34c sequence 256
Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Average latency per event: 0.24 ms
Jan 21 14:30:23 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:23.639550107 +0000 UTC
Jan 21 14:30:23 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Sending payload 800d158b8f1a171bfb31143f1973fbc9 (256 events)
Jan 21 14:30:23 logstash5 log-courier[28003]: Registrar received offsets for 256 log entries
Jan 21 14:30:23 logstash5 log-courier[28003]: Maximum pending payloads of 10 reached, holding 256 new events
Jan 21 14:30:23 logstash5 log-courier[28003]: Spooler flushing 256 events due to spool size reached
Jan 21 14:30:28 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload f3c091c13568bf08ead707af9a40775f sequence 56
Jan 21 14:30:28 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:28.732341931 +0000 UTC
Jan 21 14:30:28 logstash5 log-courier[28003]: Registrar received offsets for 56 log entries
Jan 21 14:30:33 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload f3c091c13568bf08ead707af9a40775f sequence 56
Jan 21 14:30:33 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:33.735144834 +0000 UTC
Jan 21 14:30:38 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload f3c091c13568bf08ead707af9a40775f sequence 56
Jan 21 14:30:38 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:38.736540276 +0000 UTC
Jan 21 14:30:43 logstash5 log-courier[28003]: [logstash5.XXX:YYY] Acknowledgement received for payload f3c091c13568bf08ead707af9a40775f sequence 56
Jan 21 14:30:43 logstash5 log-courier[28003]: Timeout timer reset - due at 2020-01-21 14:31:43.737618675 +0000 UTC

The logstash server is idling during that time and only has this in logs (with debug enabled):

[2020-01-21T14:31:48,559][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2020-01-21T14:31:48,757][DEBUG][logstash.inputs.courier  ] Partially acknowledging message {:nonce=>"f3c091c13568bf08ead707af9a40775f", :sequence=>56}
[2020-01-21T14:31:51,479][DEBUG][logstash.pipeline        ] Pushing flush onto pipeline {:pipeline_id=>"logs", :thread=>"#<Thread:0x35483590 sleep>"}
[2020-01-21T14:31:53,562][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2020-01-21T14:31:53,562][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2020-01-21T14:31:53,758][DEBUG][logstash.inputs.courier  ] Partially acknowledging message {:nonce=>"f3c091c13568bf08ead707af9a40775f", :sequence=>56}
[2020-01-21T14:31:56,479][DEBUG][logstash.pipeline        ] Pushing flush onto pipeline {:pipeline_id=>"logs", :thread=>"#<Thread:0x35483590 sleep>"}
[2020-01-21T14:31:58,566][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2020-01-21T14:31:58,566][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2020-01-21T14:31:58,759][DEBUG][logstash.inputs.courier  ] Partially acknowledging message {:nonce=>"f3c091c13568bf08ead707af9a40775f", :sequence=>56}
[2020-01-21T14:32:01,479][DEBUG][logstash.pipeline        ] Pushing flush onto pipeline {:pipeline_id=>"logs", :thread=>"#<Thread:0x35483590 sleep>"}
[2020-01-21T14:32:03,571][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2020-01-21T14:32:03,571][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}

Any ideas? :(

P.S. tried all versions between 7.4.0 and 7.5.1 - none work, When downgrading to 7.3.2 - everything works.

driskell commented 4 years ago

Hi

I'll see if I can test this version locally soon. I have a pending upgrade not yet planned so will end up here soon.

Just to provide some information - when you see partial acknowledgements with sequence unchanged that's usually when Logstash pipeline is stuck and not accepting any more logs. Since it looks easily reproducible I should be able to sort it soon.

I'm still maintaining WRT bugs and compatibility, just no commitment to any new features. I am still periodically experimenting with new things but as things are they're unlikely to ever be production ready due to other personal commitments.

Jason

sysmonk commented 4 years ago

Thanks, i understand the pipeline is stuck on the Logstsh side, just saying it's probably filter-input-log-courier related

Help you can reproduce it!

driskell commented 4 years ago

Running Logstash 7.6 using the following I was seeing 5000+ per second on my laptop. I'll try the stdout method now. When you use stdout {} are you seeing events appear in the output?

This is what I normally verify with:

input {
  courier {
    port => 12345
    ssl_certificate => '../selfsigned.crt'
    ssl_key => '../selfsigned.key'
  }
}
filter {
  metrics {
    meter => "events"
    add_tag => "metric"
  }
}
output {
  if "metric" in [tags] {
    stdout {
      codec => line {
        format => "rate: %{[events][rate_1m]}"
      }
    }
  }
}
driskell commented 4 years ago

With stdout {}, 7.6 works too. Downloading 7.5.1 to test with that.

driskell commented 4 years ago

It would be worth ensuring the logstash-input-courier plugin is up to date as I think I did a bug fix release previously but it was relating to failed connections rather than any fundamental issue.

driskell commented 4 years ago

I've reproduced the issue with 7.5.1. After a certain amount of time outputting logs it just stops. Looking into it now.

driskell commented 4 years ago

OK so I discovered it was my event_queue in the logstash-input-courier/log-courier Ruby gems. It uses ConditionVariables to handle context switching between pushing events from connections and then pulling them to give to Logstash as a single stream.

Logstash 7.4.0+ are likely using JRuby 9.2.8.0 and this has a bug in ConditionVariable that causes exactly this behaviour. I've confirmed 7.5.1 definitely is the buggy 9.2.8.0 and confirmed that the fix is available in 7.6.0 as they updated to 9.2.9.0. Testing against 7.6.0 and it works fine.

Upgrading to 7.6.0 should resolve the issue. It's not something fixable here it's an actual runtime issue.

For reference: https://github.com/jruby/jruby/issues/5863

driskell commented 4 years ago

7.4.0 release notes includes updating to the buggy release: https://www.elastic.co/guide/en/logstash/current/logstash-7-4-0.html

Update jruby to 9.2.8.0

7.6.0 fixes the issue with an update https://www.elastic.co/guide/en/logstash/current/logstash-7-6-0.html

Updated jruby to 9.2.9.0

JRuby release notes https://www.jruby.org/2019/10/30/jruby-9-2-9-0.html

Thread-safety Improvements
Lock interruptibility broken in JRuby 9.2.8 has now been fixed. (#5863, #5875, #5942)
sysmonk commented 4 years ago

Ah, thanks, i haven't tried 7.6.0 as it wasn't released yet when i looked into this issue. Will try it today agan

driskell commented 4 years ago

If you have any issues, just drop it into a new issue. Things are fresh in my mind now and I've got 7.6 in my test folder so should be able to pick up any issue much quicker this time around.