fluent-plugins-nursery / fluent-plugin-systemd

This is a fluentd input plugin. It reads logs from the systemd journal.
Apache License 2.0
153 stars 43 forks source link

Corrupted journal can still cause crashes #70

Closed negz closed 3 years ago

negz commented 6 years ago

Hello,

We noticed today that our fluentd was stuck in a crash loop due to a corrupted journal:

2018-08-10 13:56:42 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:00:24 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:02:41 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:04:26 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:07:33 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:09:20 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:19:15 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:24:31 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:29:28 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:39:27 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:44:28 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 14:59:15 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 15:09:25 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 15:21:36 +0000 [warn]: Error reading from Journal: Systemd::JournalError: Bad message
2018-08-10 15:29:15 +0000 [error]: Unexpected error raised. Stopping the timer. title=:in_systemd_emit_worker error_class=Systemd::JournalError error="Bad message"
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/systemd-journal-1.3.2/lib/systemd/journal/navigable.rb:44:in `move_next'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/fluent-plugin-systemd-1.0.1/lib/fluent/plugin/in_systemd.rb:131:in `watch'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/fluent-plugin-systemd-1.0.1/lib/fluent/plugin/in_systemd.rb:109:in `run'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/fluentd-0.14.25/lib/fluent/plugin_helper/timer.rb:77:in `on_timer'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/fluentd-0.14.25/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
  2018-08-10 15:29:15 +0000 [error]: /var/lib/gems/2.3.0/gems/fluentd-0.14.25/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-08-10 15:29:15 +0000 [error]: Timer detached. title=:in_systemd_emit_worker
2018-08-10 15:37:11 +0000 [info]: fluentd worker is now stopping worker=0
2018-08-10 15:37:11 +0000 [info]: shutting down fluentd worker worker=0
2018-08-10 15:37:11 +0000 [info]: shutting down input plugin type=:systemd plugin_id="object:3f90cf00897c"
2018-08-10 15:37:11 +0000 [info]: shutting down input plugin type=:systemd plugin_id="object:3f90d088561c"
2018-08-10 15:37:11 +0000 [info]: shutting down input plugin type=:systemd plugin_id="object:3f90cf114d5c"
2018-08-10 15:37:11 +0000 [info]: shutting down output plugin type=:rewrite_tag_filter plugin_id="object:3f90d2bb4f5c"
2018-08-10 15:37:11 +0000 [info]: shutting down output plugin type=:google_cloud plugin_id="object:3f90cf07dc40"
2018-08-10 15:37:11 +0000 [info]: shutting down filter plugin type=:kubernetes_metadata plugin_id="object:3f90d08c567c"
2018-08-10 15:37:11 +0000 [info]: shutting down filter plugin type=:record_transformer plugin_id="object:3f90d09b3cf0"
2018-08-10 15:37:11 +0000 [info]: shutting down filter plugin type=:parser plugin_id="object:3f90ce8b1804"
2018-08-10 15:37:11 +0000 [info]: shutting down output plugin type=:null plugin_id="object:3f90d27cbc38"
2018-08-10 15:37:11 +0000 [info]: shutting down output plugin type=:null plugin_id="object:3f90d2baa3a4
$ journalctl --verify
# ...many files redacted
PASS: /var/log/journal/2889e3617ecbc9faa794ad67c5d2e065/user-1000@0da0f119c85e4374a2e8e71427bc7fa0-0000000009ff5577-000573171be6b567.journal
75eff50: Invalid data object at hash entry 5544 of 233016                 
File corruption detected at /var/log/journal/2889e3617ecbc9faa794ad67c5d2e065/system.journal:75ed1b0 (of 125829120 bytes, 98%).
FAIL: /var/log/journal/2889e3617ecbc9faa794ad67c5d2e065/system.journal (Bad message)
PASS: /var/log/journal/2889e3617ecbc9faa794ad67c5d2e065/system@36b99449ae7042c2904d3eee68a72cf7-0000000009d630aa-000573152a476352.journal

I see that #16 fixed an issue like this - my guess is that there's another code path that can be affected by this issue but that does not catch the error. We're running version 1.0.1 with the following other moving pieces:

gem 'fluentd', '~>0.14.24'
gem 'fluent-plugin-systemd', '~>1.0.1'
gem 'fluent-plugin-google-cloud', '~>0.6.21'
gem 'fluent-plugin-kubernetes_metadata_filter', '~>2.1.2'
gem 'fluent-plugin-rewrite-tag-filter', '~>2.0.2'
JoshuaOliphant commented 5 years ago

This issue is happening for me too. In fluentd logs:

2019-05-30 01:52:12.729014141 +0000 fluent.error: {"title":"in_systemd_emit_worker","error":"#<Systemd::JournalError: Bad message>","message":"[journald] Unexpected error raised. Stopping the timer. title=:in_systemd_emit_worker error_class=Systemd::JournalError error=\"Bad message\""}

and

2019-05-30 01:52:12 +0000 [error]: #0 [journald] Unexpected error raised. Stopping the timer. title=:in_systemd_emit_worker error_class=Systemd::JournalError error="Bad message

Received these 2 logs a single time. The logs on this particular kubernetes pod stopped flowing. I deleted the pod, it came back up, and then the logs started flowing again.

I ran journalctl --verify from the node where the pod was running, but also nodes there the pod was NOT running, and saw the same FAIL messages. They came and went in both cases, so I don't think that was important.

mariusgrigoriu commented 5 years ago

Here's the full log related to this

2019-05-31 16:52:15 +0000 [warn]: #0 [journald] Error reading from Journal: Systemd::JournalError: Bad message
2019-05-31 16:52:15 +0000 [error]: #0 [journald] Unexpected error raised. Stopping the timer. title=:in_systemd_emit_worker error_class=Systemd::JournalError error="Bad message"
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/systemd-journal-1.3.3/lib/systemd/journal/navigable.rb:44:in `move_next'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:131:in `watch'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:109:in `run'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2019-05-31 16:52:15 +0000 [error]: #0 /usr/local/bundle/gems/fluentd-1.4.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2019-05-31 16:52:15 +0000 [error]: #0 [journald] Timer detached. title=:in_systemd_emit_worker
2019-05-31 16:52:15.913211850 +0000 fluent.warn: {"message":"[journald] Error reading from Journal: Systemd::JournalError: Bad message"}
2019-05-31 16:52:15.916827918 +0000 fluent.error: {"title":"in_systemd_emit_worker","error":"#<Systemd::JournalError: Bad message>","message":"[journald] Unexpected error raised. Stopping the timer. title=:in_systemd_emit_worker error_class=Systemd::JournalError error=\"Bad message\""}
2019-05-31 16:52:15.917048234 +0000 fluent.error: {"title":"in_systemd_emit_worker","message":"[journald] Timer detached. title=:in_systemd_emit_worker"}
errm commented 3 years ago

Thanks to the contribution of @mariusgrigoriu this should now be fixed, a new version has been released https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/releases/tag/v1.0.3