fluent-plugins-nursery / fluent-plugin-systemd

This is a fluentd input plugin. It reads logs from the systemd journal.
Apache License 2.0
153 stars 43 forks source link

Call get_fd() to avoid missing file rotations #65

Closed portante closed 6 years ago

portante commented 6 years ago

See https://github.com/systemd/systemd/issues/7998 and https://github.com/rsyslog/rsyslog/pull/2437 for more information and background.

errm commented 6 years ago

Thanks for the patch ...

Could you explain the issue you are seeing? Is there a way we can add a test to exercise the behaviour? I am not a big fan of merging bugfixes without a test or at the very least a detailed comment to explain the issue ... not pinning this down is a recipe for a regression later ...

Also looking at CI this breaks many tests ... I haven't looked closely, but it does seem that this change does indeed break the build.

I would also like to document the issue upstream with https://github.com/ledbettj/systemd-journal if we can

portante commented 6 years ago

@errm, here is the comment that describes what is happening in https://github.com/systemd/systemd/issues/7998#issuecomment-360895127.

If a journald client does not immediately call sd_journal_get_fd() (which sets up the internal-to-journal API inotify FD for all open journal files), any journal file rotation that happens before the first sd_journal_*() API that sets up the inotify FD will result in "leaked" journal files, marked for delete but still held open by clients. In certain configurations, this can lead to disk space problems on the volume where those files live (/run/log or /var/log).

To test this one has to:

  1. start a journald client running (which does not immediately call sd_journal_get_fd())
  2. start a load on the journal to cause journal file rotation
  3. use lsof -p <client pid> | grep -E "/(run|var)/log" | grep deleted to observe the deleted journal files
  4. stop the load on the journal
  5. observe that the deleted journal files are still there over time

With the fix in place, one will still periodically observe deleted journal files, but over time they'll be cleaned up.

Regarding the existing unit test failures, it seems odd that making such a call changes the behavior that way. It is likely causing journal to ignore the corrupt journal fixture, and that we'll need to "corrupt" them another way.

Regarding upstream, filing an issue is fine, but not sure how much that code base should change if the journald APIs haven't changed (they did fix journald to close and vacate journal files, see https://github.com/systemd/systemd/pull/8144).

errm commented 6 years ago

The error occurs because you are calling a method #get_fd that does not exist ...

#<NoMethodError: undefined method `get_fd' for #<Systemd::Journal:0x0000000273d690>>

In order to ensure we call the native get_fd we have two options ...

@ledbettj do you have a view about what we should do here?

ledbettj commented 6 years ago

I don't see any issue with exposing file_descriptor, I think I marked it private only because I didn't think there was any external use for it.

If there is an auto-magical fix for this that we can make we can do that as well. The linked issue is supposedly resolved by https://github.com/systemd/systemd/pull/8144 -- so is this not a problem in newer versions of libsystemd?

portante commented 6 years ago

@ledbettj, newer versions of systemd still leak FDs, but they won't take up diskspace on systems that let you "vacuum" out the contents of a file on disk. It is not a complete fix for all environments, unfortunately. By always calling sd_journal_get_fd() immediately following a sd_journal_open() gives the inotify setup a chance to avoid loosing track of a file rotation (there is still a small window between the open and get_fd calls, though).

Hiding this in the open method might be the best way to avoid having to force clients to perform this sequence properly.

reevoo-samuel commented 6 years ago

Er, hello?

This Pull Request is too damn old! Merge or close this, sucka.

errm commented 6 years ago

I opened a PR upstream that I think should solve these issues https://github.com/ledbettj/systemd-journal/pull/78

I think it will also mean that I can clean up a lot of the mess around dealing with rotations ...

errm commented 6 years ago

Thanks for your help @portante I just pushed 1.0.1 to rubygems ...