fluent-plugins-nursery / fluent-plugin-systemd

This is a fluentd input plugin. It reads logs from the systemd journal.
Apache License 2.0
153 stars 43 forks source link

High CPU usage with v0.0.7 #32

Closed timstclair closed 7 years ago

timstclair commented 7 years ago

We observed a huge increase in CPU usage between version v0.0.6 and v0.0.7 using the following config:

    # Logs from systemd-journal for interesting services.
    <source>
      type systemd
      filters [{ "_SYSTEMD_UNIT": "docker.service" }]
      pos_file /var/log/gcp-journald-docker.pos
      read_from_head true
      tag docker
    </source>
    <source>
      type systemd
      filters [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      pos_file /var/log/gcp-journald-kubelet.pos
      read_from_head true
      tag kubelet
    </source>

It looks like there was only a single commit between those versions: https://github.com/reevoo/fluent-plugin-systemd/commit/bcbd53ed42b85df5e8ec9b42410bf43a4d0c2683

See https://github.com/kubernetes/kubernetes/issues/42515 for more background on how the issue manifested in our setup.

I'm happy to provide more details or repro steps if it would help.

errm commented 7 years ago

Thanks for the report @timstclair I will take a look and hopefully get a fix out in a few hours

msonnabaum commented 7 years ago

Also found this today. Looking at bbf9333, wouldn't journal.wait(1_000_000) work just as well as the sleep 1?

errm commented 7 years ago

It would, but the issue is we are now monitoring the responses to journal.wait in order to get notifications of when the journal has been rotated. I spent a few hours trying to work out a solution that would:

a) fix the issue b) not cause the journal rotation bug to re-occur c) not cause the automated tests to become much more complicated

As we are using the timer helper in the mainline branch and don't have to deal with this stuff directly this seems like the simplest solution for now, that didn't involve huge changes to the code / tests.

Feel free to have a go at doing something better though, I am happy to accept a contribution :)

errm commented 7 years ago

@timstclair I have just pushed version 0.0.8, it would be great if you would able to test this version in your environment?

timstclair commented 7 years ago

I confirmed I'm not seeing the issue with 0.0.8. Thanks!

arianvp commented 7 years ago

I am experiencing this issue with 0.3.0. However, we do exports hundreds of log entries per second. Syslog exporter seems to cope just fine. journald-upload (which we are using currently for log aggregation) doesnt mind either. but this plugin ends up using 100% CPU

arianvp commented 7 years ago

I did a strace in one of our live systems. and 99% of the time it's calling the futex syscall

errm commented 7 years ago

Hi @arianvp this is not the same bug since we replaced the wait code with a fluentd timer a while ago.

Could you open a new issue with the exact versions of fluentd + systemd you are using...