GoogleCloudPlatform / google-fluentd

Packaging scripts for the Stackdriver logging agent (google-fluentd).
https://cloud.google.com/logging/docs/agent/
Apache License 2.0
140 stars 50 forks source link

Memory leak in ruby process? #368

Open flori opened 2 years ago

flori commented 2 years ago

I had one of these killed recently by the linux OOM killer – after 87 days running on the google cloud:

[7595615.399139] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,mems_allowed=0,global_oom,task_memcg=/docker/b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,task=ruby,pid=1237,uid=0
[7595615.424623] Out of memory: Killed process 1237 (ruby) total-vm:1254608kB, anon-rss:614712kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1940kB oom_score_adj:0
[7595615.528767] oom_reaper: reaped process 1237 (ruby), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

There seems to be a memory leak somewhere.

jkschulz commented 2 years ago

Hey, thanks for reporting this! Can you provide some more detail about your environment and agent so that we can try to reproduce?

flori commented 2 years ago

This is a single Compute Engine VM running on GCP's europe-west1-b zone. I cannot find out which ruby it is running or which gems it is using, only that it is currently running cos-stable-89-16108-470-16. The leaking seems to be still an issue, the fluentd ruby already takes up one third of the VM's memory at present. See this metrics graph I made for the vm memory used:

Screenshot 2021-12-30 at 10 38 39

elbakerino commented 2 years ago

Just had a failure of the whole cluster, because of google-fluentd ruby memory issues.

Infrastructure is complelty logged with google-fluentd and prometheus, ubuntu/centos servers for docker hosts+nginx load balancers and some custom apps. Error recognized because the docker hosts where not able to run some containers (OOM), nginx still manged to work with 2% free memory. Don't know when it started, had a similiar issue a 2-3 months ago where I thought docker logs where the problem and during that time a reboot-fix was all i had time for.

For most servers a sudo systemctl restart google-fluentd fixed it, the ruby process restarted and memory was freed.

Log of /var/log/google-fluentd/google-fluentd.log till restart, on a server where it worked:

2022-01-16 03:26:02 +0000 [info]: #0 flushing all buffer forcedly
2022-01-16 03:26:02 +0000 [info]: #0 detected rotation of /var/log/messages; waiting 5 seconds
2022-01-16 03:26:02 +0000 [info]: #0 following tail of /var/log/messages
2022-01-16 13:04:57 +0000 [info]: Received graceful stop
2022-01-16 13:04:58 +0000 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 13:04:58 +0000 [info]: #0 shutting down fluentd worker worker=0

On some servers, the restart/stop took ages, and i've just killed the google-fluentd service, the ruby server was still existing and I've needed to kill it with SIGKILL kill -9 <pid> to free the memory, without -9 worked on some, but not on all.

Log on such a server only shows the latest stop (this nginx servers timezone is somehow germany, not in utc like the other one), value of object replaced dont know if maybe sensible:

2022-01-16 03:42:01 +0100 [info]: #0 flushing all buffer forcedly
2022-01-16 10:18:44 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=85.15702900663018 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 11:52:57 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.46101826429367 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 14:10:53 +0100 [info]: Received graceful stop
2022-01-16 14:10:54 +0100 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 14:10:54 +0100 [info]: #0 shutting down fluentd worker worker=0

Commands used:

sudo systemctl restart google-fluentd
top # ram was not free
sudo systemctl restart google-fluentd # took ages
sudo systemctl stop google-fluentd # took ages
sudo systemctl kill google-fluentd
sudo systemctl start google-fluentd
top # ram was not free
sudo kill -9 <pid-or-ruby>
top # ram was free
sudo reboot # just to be sure everything was/is running as it should

What further infos would help and how could I get them easily?

EDeijl commented 2 years ago

We are running into the same issue:

It is installed on a VM instance running COS 89 with the metadata tag google-logging-enabled: true

restarting the stackdriver-logging-agent mitigates the issue, but it will come back in a while.

the specific image with the memory leak is: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.8.4

An older version of that image is not having the same memory issue: gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1

igorpeshansky commented 2 years ago

Is this the same as https://github.com/fluent/fluentd/issues/3401?

[EDIT] Actually, fluent-plugin-systemd was upgraded to 1.0.5 in 1.9.1, but https://github.com/GoogleCloudPlatform/google-fluentd/issues/368#issuecomment-1046983190 mentions 1.8.4 leaking as well.

[ANOTHER EDIT] Mystery potentially solved: the fluent-plugin-systemd misbehavior with read_from_head set, diagnosed in the referenced issue, was already present in version 1.0.2 (used by google-fluentd 1.8.4).