Closed portante closed 6 years ago
Thanks for the patch ...
Could you explain the issue you are seeing? Is there a way we can add a test to exercise the behaviour? I am not a big fan of merging bugfixes without a test or at the very least a detailed comment to explain the issue ... not pinning this down is a recipe for a regression later ...
Also looking at CI this breaks many tests ... I haven't looked closely, but it does seem that this change does indeed break the build.
I would also like to document the issue upstream with https://github.com/ledbettj/systemd-journal if we can
@errm, here is the comment that describes what is happening in https://github.com/systemd/systemd/issues/7998#issuecomment-360895127.
If a journald client does not immediately call sd_journal_get_fd()
(which sets up the internal-to-journal API inotify FD for all open journal files), any journal file rotation that happens before the first sd_journal_*()
API that sets up the inotify FD will result in "leaked" journal files, marked for delete but still held open by clients. In certain configurations, this can lead to disk space problems on the volume where those files live (/run/log
or /var/log
).
To test this one has to:
sd_journal_get_fd()
)lsof -p <client pid> | grep -E "/(run|var)/log" | grep deleted
to observe the deleted journal filesWith the fix in place, one will still periodically observe deleted journal files, but over time they'll be cleaned up.
Regarding the existing unit test failures, it seems odd that making such a call changes the behavior that way. It is likely causing journal to ignore the corrupt journal fixture, and that we'll need to "corrupt" them another way.
Regarding upstream, filing an issue is fine, but not sure how much that code base should change if the journald APIs haven't changed (they did fix journald to close and vacate journal files, see https://github.com/systemd/systemd/pull/8144).
The error occurs because you are calling a method #get_fd
that does not exist ...
#<NoMethodError: undefined method `get_fd' for #<Systemd::Journal:0x0000000273d690>>
In order to ensure we call the native get_fd we have two options ...
@journal.send(:file_descriptor)
- Not a fan of relying on a private method here though@jounal.wait(0)
on the following line to @jounal.wait(0, select: true)
@ledbettj do you have a view about what we should do here?
get_fd
file_descriptor
from waitable part of the public api of systemd-journal ...I don't see any issue with exposing file_descriptor
, I think I marked it private only because I didn't think there was any external use for it.
If there is an auto-magical fix for this that we can make we can do that as well. The linked issue is supposedly resolved by https://github.com/systemd/systemd/pull/8144 -- so is this not a problem in newer versions of libsystemd?
@ledbettj, newer versions of systemd still leak FDs, but they won't take up diskspace on systems that let you "vacuum" out the contents of a file on disk. It is not a complete fix for all environments, unfortunately. By always calling sd_journal_get_fd()
immediately following a sd_journal_open()
gives the inotify setup a chance to avoid loosing track of a file rotation (there is still a small window between the open and get_fd calls, though).
Hiding this in the open method might be the best way to avoid having to force clients to perform this sequence properly.
Er, hello?
This Pull Request is too damn old! Merge or close this, sucka.
I opened a PR upstream that I think should solve these issues https://github.com/ledbettj/systemd-journal/pull/78
I think it will also mean that I can clean up a lot of the mess around dealing with rotations ...
Thanks for your help @portante I just pushed 1.0.1 to rubygems ...
See https://github.com/systemd/systemd/issues/7998 and https://github.com/rsyslog/rsyslog/pull/2437 for more information and background.