Telegraf unable to collect Kubernetes stats after cluster failure

gedimin45 commented 7 years ago

Hello,

I recently upgraded my K8s cluster running on GKE by just hitting the upgrade button on the node pool. Not the smartest thing to do, I know, since it took my cluster down for a few minutes. After the cluster was back up again, the Deis pods also seemed back up and my apps were working. Few days later I noticed that metrics do not appear in Grafana anymore. The Telegraf pod log was filled with the following messages:

E! ERROR in input [kubernetes]: Errors encountered: [error making HTTP request to http://10.132.0.9:10255/stats/summary: dial tcp 10.132.0.9:10255: connect: cannot assign requested address]

I am no K8s expert, but I assume that after the cluster restart, the address that K8s exposes metrics on changed and that didn't get updated when the pod was recreated. This also resulted in the Telegraf pods consuming 1GB memory each. When I deleted the Telegraf pods and K8s recreated them, the metrics reappeared and memory consumption went down to ~10MB. Could someone point to a place where this issue could be fixed so Deis could avoid issues after a clumsy cluster restart (or total cluster failure)? I'd also be happy to contribute :)

jchauncey commented 7 years ago

Hrm I'll try reproducing this and see what I find

felixbuenemann commented 7 years ago

@ged15 I think the above error might be caused by #153 and your host simply ran out of ephemeral ports, because the memory leak was caused by leaked tcp connections.

You can check the allocated port range using cat /proc/sys/net/ipv4/ip_local_port_range on the host. I had one host affected by #153 with ~23,000 active connection (~730 MB RSS) and another one with ~26.000 active connections (~945 MB RSS) and for example on my CoreOS hosts I only have 28,231 ephemeral ports.

If your host was using similar defaults it must have reached the 28k connections at 1GB memory usage.

@jchauncey is already working on a fix and I'm already running a fixed build that solved the issue on the two affected clusters I tried it on.

bacongobbler commented 7 years ago

@ged15 can you try upgrading to v2.8.0 and see if your issue is resolved on the latest release?

gedimin45 commented 7 years ago

The problem occurred after I accidentally killed my whole cluster by upgrading it to the newest K8s version. Might have been a problem with Telegraf leaking connections. Anyways, with Workflow 2.8 it seems to be fixed, so closing this issue :) Thanks for the help people!

deis / monitor

Telegraf unable to collect Kubernetes stats after cluster failure #152