fluentd pod hogging memory after spike in traffic

deis / monitor

Monitoring for Deis Workflow

https://deis.com

MIT License

22 stars 32 forks source link

fluentd pod hogging memory after spike in traffic #166

Closed gedimin45 closed 7 years ago

gedimin45 commented 7 years ago

I deployed and app that output logs in a rapid pace (10s of message per second) and the fluentd pod memory consumption increased. Makes sense, but after I killed the app that was spamming logs, the memory consumption stays high. Seems like some messages just stay in memory. I have also encountered this in the logs:

2016-12-14T14:03:29.277996730Z 2016-12-14 14:03:29 +0000 [warn]: emit transaction failed: error_class=NoMethodError error="undefined method `[]' for nil:NilClass" tag="kubernetes.var.log.containers.folksy-vacation-web-3441316637-w3jqs_folksy-vacation_POD-6998fb92734c2116c49ecff689f6314b7581c6931caf24717e90c0c8fc3da505.log"
2016-12-14T14:03:29.278460840Z   2016-12-14 14:03:29 +0000 [warn]: suppressed same stacktrace

bacongobbler commented 7 years ago

So the undefined method '[]' for nil:NilClass error from fluentd is a well-known unidentified upstream bug: https://github.com/fluent/fluentd/issues/1248

As for the memory consumption increase, given that fluentd and nsq needs to process all messages I'd imagine that'd stay under high memory consumption until the messages have been processed. Your throughput is simply slower than your input. Is there some form of enhancement you'd like to see from that?

gedimin45 commented 7 years ago

The memory consumption did not go down one bit even after a day after the spike. I think that when the undefined method error occurs, the object is not removed from memory but that is just a guess. Would it be possible to predict a memory limit for the fluentd pod? Perhaps set a sane default in the chart?

bacongobbler commented 7 years ago

it is commented out by default for clusters that want to run the monitoring stack with unlimited resources, but it is there. "sane defaults" differ from cluster to cluster, so there's no silver bullet.

gedimin45 commented 7 years ago

Got it. Pretty sure this is a bug in fluentd but I do not feel like debugging Ruby code that might have a bug under high load 😄 Thanks for the suggestions!