Closed gedimin45 closed 7 years ago
Alright I will take a look at this today/monday
I see the same issue, I setup a new cluster on CoreOS Beta / docker 1.11.2 / kube-aws / k8s 1.4.3 / workflow 2.7.0 on Firday and the controller became unresponsive today because telegraf is eating ~700 MByte/s RAM and the machine only has 2GB which is usally fine for a small cluster controller.
The logs of the telegraf pod on the controller show lots of entries like these, not sure if they are normal of part of the issue:
ERROR: input [nsq] took longer to collect than collection interval (1s)
@ged15 Which kubernetes version are you running?
My K8s version:
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.1", GitCommit:"33cf7b9acbb2cb7c9c72a10d6636321fb180b159", GitTreeState:"clean", BuildDate:"2016-10-10T18:19:49Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.3", GitCommit:"4957b090e9a4f6a68b4a40375408fdc74a212260", GitTreeState:"clean", BuildDate:"2016-10-16T06:20:04Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Running on GCloud.
The last entry in the logs of the Telegraf pod (no log entries afterwards):
2016/10/21 07:29:48 INF 1 [metrics/consumer] (10.19.246.231:4150) connecting to nsqd
Not sure if this is useful, but I do not get ERROR: input [nsq] took longer to collect than collection interval (1s)
appear in my logs.
The timeouts in my case might be related to the controller running out of memory. I had to add a swapfile to be able to talk to the k8s api again.
I did some debugging with @jchauncey and it appears the telegraf pod is leaking tcp connection to cadvisor (port 10255). You could check on a node by running netstat -tan | grep 10255 | wc -l
, in my case there were more than 24k open connections.
@jchauncey There's currently no milestone assigned. Will the fix be included in Workflow v2.8.0?
yeah we are holding the release until we get this fixed
I think he's getting that error because nsq is slow to respond due to the memory pressure
On Oct 24, 2016 9:16 AM, "Gediminas Šedbaras" notifications@github.com wrote:
The last entry in the logs of the Telegraf pod (no log entries afterwards):
2016/10/21 07:29:48 INF 1 metrics/consumer connecting to nsqd
Not sure if this is useful, but I do not get ERROR: input [nsq] took longer to collect than collection interval (1s) appear in my logs.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deis/monitor/issues/153#issuecomment-255737357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGJGXp7FbykbcgJBSK7M0BwBdPLTzks5q3K_CgaJpZM4KdC8u .
Just updated to 2.8 and the problem seems gone. Thanks people! Although now the graph looks weird as it only displays data for the last 3 minutes: Is this expected?
Yeah the graph resolution is my fault. You can change it in the top right corner. Ill fix it for next release.
Thanks @jchauncey and @felixbuenemann!
Telegraf memory usage seems to grow steadily over time in my cluster:
At first I let it grow expecting that the GC would kick in, but it never did. When consumption reached 1G, I just deleted the pods (which made the
DaemonSet
to create new ones).When I removed
ENABLE_KUBERNETES
(setting it tofalse
did not have any effect) from theDaemonSet
manifest, the memory consumption stayed at ~13 MB. Of course then no metrics appear in Grafana.