Open flori opened 2 years ago
Hey, thanks for reporting this! Can you provide some more detail about your environment and agent so that we can try to reproduce?
This is a single Compute Engine VM running on GCP's europe-west1-b
zone. I cannot find out which ruby it is running or which gems it is using, only that it is currently running cos-stable-89-16108-470-16
. The leaking seems to be still an issue, the fluentd ruby already takes up one third of the VM's memory at present. See this metrics graph I made for the vm memory used:
Just had a failure of the whole cluster, because of google-fluentd
ruby memory issues.
Infrastructure is complelty logged with google-fluentd and prometheus, ubuntu/centos servers for docker hosts+nginx load balancers and some custom apps. Error recognized because the docker hosts where not able to run some containers (OOM), nginx still manged to work with 2% free memory. Don't know when it started, had a similiar issue a 2-3 months ago where I thought docker logs where the problem and during that time a reboot-fix was all i had time for.
For most servers a sudo systemctl restart google-fluentd
fixed it, the ruby
process restarted and memory was freed.
Log of /var/log/google-fluentd/google-fluentd.log
till restart, on a server where it worked:
2022-01-16 03:26:02 +0000 [info]: #0 flushing all buffer forcedly
2022-01-16 03:26:02 +0000 [info]: #0 detected rotation of /var/log/messages; waiting 5 seconds
2022-01-16 03:26:02 +0000 [info]: #0 following tail of /var/log/messages
2022-01-16 13:04:57 +0000 [info]: Received graceful stop
2022-01-16 13:04:58 +0000 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 13:04:58 +0000 [info]: #0 shutting down fluentd worker worker=0
On some servers, the restart
/stop
took ages, and i've just kill
ed the google-fluentd service, the ruby
server was still existing and I've needed to kill it with SIGKILL kill -9 <pid>
to free the memory, without -9
worked on some, but not on all.
Log on such a server only shows the latest stop
(this nginx servers timezone is somehow germany, not in utc like the other one), value of object
replaced dont know if maybe sensible:
2022-01-16 03:42:01 +0100 [info]: #0 flushing all buffer forcedly
2022-01-16 10:18:44 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=85.15702900663018 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 11:52:57 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.46101826429367 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 14:10:53 +0100 [info]: Received graceful stop
2022-01-16 14:10:54 +0100 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 14:10:54 +0100 [info]: #0 shutting down fluentd worker worker=0
Commands used:
sudo systemctl restart google-fluentd
top # ram was not free
sudo systemctl restart google-fluentd # took ages
sudo systemctl stop google-fluentd # took ages
sudo systemctl kill google-fluentd
sudo systemctl start google-fluentd
top # ram was not free
sudo kill -9 <pid-or-ruby>
top # ram was free
sudo reboot # just to be sure everything was/is running as it should
What further infos would help and how could I get them easily?
We are running into the same issue:
It is installed on a VM instance running COS 89 with the metadata tag google-logging-enabled: true
restarting the stackdriver-logging-agent mitigates the issue, but it will come back in a while.
the specific image with the memory leak is: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.8.4
An older version of that image is not having the same memory issue: gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1
Is this the same as https://github.com/fluent/fluentd/issues/3401?
[EDIT] Actually, fluent-plugin-systemd
was upgraded to 1.0.5 in 1.9.1, but https://github.com/GoogleCloudPlatform/google-fluentd/issues/368#issuecomment-1046983190 mentions 1.8.4 leaking as well.
[ANOTHER EDIT] Mystery potentially solved: the fluent-plugin-systemd
misbehavior with read_from_head
set, diagnosed in the referenced issue, was already present in version 1.0.2 (used by google-fluentd 1.8.4).
I had one of these killed recently by the linux OOM killer – after 87 days running on the google cloud:
There seems to be a memory leak somewhere.