Telegraf memory leak - Githubissues

gedimin45 commented 7 years ago

Telegraf memory usage seems to grow steadily over time in my cluster:

At first I let it grow expecting that the GC would kick in, but it never did. When consumption reached 1G, I just deleted the pods (which made the DaemonSet to create new ones).

When I removed ENABLE_KUBERNETES (setting it to false did not have any effect) from the DaemonSet manifest, the memory consumption stayed at ~13 MB. Of course then no metrics appear in Grafana.

jchauncey commented 7 years ago

Alright I will take a look at this today/monday

felixbuenemann commented 7 years ago

I see the same issue, I setup a new cluster on CoreOS Beta / docker 1.11.2 / kube-aws / k8s 1.4.3 / workflow 2.7.0 on Firday and the controller became unresponsive today because telegraf is eating ~700 MByte/s RAM and the machine only has 2GB which is usally fine for a small cluster controller.

The logs of the telegraf pod on the controller show lots of entries like these, not sure if they are normal of part of the issue:

ERROR: input [nsq] took longer to collect than collection interval (1s)

felixbuenemann commented 7 years ago

@ged15 Which kubernetes version are you running?

gedimin45 commented 7 years ago

My K8s version:

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"33cf7b9acbb2cb7c9c72a10d6636321fb180b159", GitTreeState:"clean", BuildDate:"2016-10-10T18:19:49Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.3", GitCommit:"4957b090e9a4f6a68b4a40375408fdc74a212260", GitTreeState:"clean", BuildDate:"2016-10-16T06:20:04Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Running on GCloud.

gedimin45 commented 7 years ago

The last entry in the logs of the Telegraf pod (no log entries afterwards):

2016/10/21 07:29:48 INF 1 [metrics/consumer] (10.19.246.231:4150) connecting to nsqd

Not sure if this is useful, but I do not get ERROR: input [nsq] took longer to collect than collection interval (1s) appear in my logs.

felixbuenemann commented 7 years ago

The timeouts in my case might be related to the controller running out of memory. I had to add a swapfile to be able to talk to the k8s api again.

felixbuenemann commented 7 years ago

I did some debugging with @jchauncey and it appears the telegraf pod is leaking tcp connection to cadvisor (port 10255). You could check on a node by running netstat -tan | grep 10255 | wc -l, in my case there were more than 24k open connections.

felixbuenemann commented 7 years ago

@jchauncey There's currently no milestone assigned. Will the fix be included in Workflow v2.8.0?

jchauncey commented 7 years ago

yeah we are holding the release until we get this fixed

jchauncey commented 7 years ago

I think he's getting that error because nsq is slow to respond due to the memory pressure

On Oct 24, 2016 9:16 AM, "Gediminas Šedbaras" notifications@github.com wrote:

The last entry in the logs of the Telegraf pod (no log entries afterwards):

2016/10/21 07:29:48 INF 1 metrics/consumer connecting to nsqd

Not sure if this is useful, but I do not get ERROR: input [nsq] took longer to collect than collection interval (1s) appear in my logs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deis/monitor/issues/153#issuecomment-255737357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGJGXp7FbykbcgJBSK7M0BwBdPLTzks5q3K_CgaJpZM4KdC8u .

gedimin45 commented 7 years ago

Just updated to 2.8 and the problem seems gone. Thanks people! Although now the graph looks weird as it only displays data for the last 3 minutes: Is this expected?

jchauncey commented 7 years ago

Yeah the graph resolution is my fault. You can change it in the top right corner. Ill fix it for next release.

gedimin45 commented 7 years ago

Thanks @jchauncey and @felixbuenemann!

deis / monitor

Telegraf memory leak #153