deis / monitor

Monitoring for Deis Workflow
https://deis.com
MIT License
22 stars 32 forks source link

Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146

Closed felipejfc closed 7 years ago

felipejfc commented 7 years ago
Containers:
  deis-monitor-telegraf:
    Container ID:   docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238
    Image:      quay.io/deis/telegraf:v2.4.0
    Image ID:       docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a
    Port:
    State:      Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    7
      Started:      Sun, 02 Oct 2016 15:35:26 -0300
      Finished:     Sun, 02 Oct 2016 15:35:29 -0300
    Ready:      False
    Restart Count:  7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled on the master is in crashloopbackoff for some reason, the ones scheduled on minions are normal though.

jchauncey commented 7 years ago

What do the pod logs say?

On Oct 2, 2016 2:42 PM, "Felipe Cavalcanti" notifications@github.com wrote:

Containers: deis-monitor-telegraf: Container ID: docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238 Image: quay.io/deis/telegraf:v2.4.0 Image ID: docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a Port: State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 7 Started: Sun, 02 Oct 2016 15:35:26 -0300 Finished: Sun, 02 Oct 2016 15:35:29 -0300 Ready: False Restart Count: 7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled on the master is in crashloopbackoff for some reason, the ones scheduled on minions are normal though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deis/monitor/issues/146, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGCEsB4hgWyVU9B5uOdd7oqi3FTX1ks5qv_sYgaJpZM4KMFG_ .

felipejfc commented 7 years ago

Nothing hepful...

$ kube-stag logs deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics
$ kube-stag logs -p deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics
jchauncey commented 7 years ago

You should have a lot more output than that. That means something is wrong with the image. I did an install last night on a 1.4.0 cluster with 2.6.0 and it everything came up fine.

jchauncey commented 7 years ago

Did you change anything in the chart configuration? Or is this a stock install? Have you tried deleting the pods and recreating them using the daemonset file in the manifest directory of the chart?

felipejfc commented 7 years ago

It is a stock install... I tried deleting the pod (note that not all of them are in crashloop, only the one in the master node)

it seems to be restarting like every 2 minutes

felipejfc commented 7 years ago

just upgraded deis to 2.6, monitor-telegraf was bumped to version 2.5.1 and the problem is still happening... :/

jchauncey commented 7 years ago

Telegraf running on the master node is a new behavior I think. I noticed too that on my 1.4 cluster it shows us in the list when you do kubectl get nodes not sure why they are doing that.

What os are you using?

felipejfc commented 7 years ago

Debian with kernel 4.4

WillPlatnick commented 7 years ago

:+1: same issue, deployed 1.4 via kops

shulcsm commented 7 years ago

Same issue, fresh install.

jchauncey commented 7 years ago

@shulcsm kops too? @felipejfc are you also using kops

shulcsm commented 7 years ago

kubeadm on Ubuntu 16.04

jchauncey commented 7 years ago

Ok there is a hunch going around that this may be related to some kubernetes 1.4 work where they made add-ons daemonsets (which is how we deploy telegraf). This is why we are seeing telegraf get scheduled onto the master node. I am working on a way to restrict that from happening.

felipejfc commented 7 years ago

@jchauncey yes! I do use kops

jchauncey commented 7 years ago

This is related to this issue - https://github.com/kubernetes/kubernetes/issues/29178

jchauncey commented 7 years ago

I'm not entire sure how to solve this problem yet. Considering that 1.4 doesn't have a label to tell a daemonset to not schedule there. I'll keep thinking about other ways to solve htis. But I would still like to know why telegraf is crashlooping.

shulcsm commented 7 years ago

I tainted the master (my cluster consists of one node) and everyhing is running now.

jchauncey commented 7 years ago

@shulcsm what taint did you apply?

shulcsm commented 7 years ago

kubectl taint nodes --all dedicated-

jchauncey commented 7 years ago

@felipejfc and @WillPlatnick if you two can see if the above command fixes your issue that would be great

felipejfc commented 7 years ago

Well, what would be the implications on tainting my master with dedicated-?

jchauncey commented 7 years ago

Afaik it should make it so nothing runs on it

On Oct 4, 2016 5:32 PM, "Felipe Cavalcanti" notifications@github.com wrote:

Well, what would be the implications on tainting my master with dedicated-?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/deis/monitor/issues/146#issuecomment-251520067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaRGD0fH5OSzqjqKgzjkvHDTbctJhT4ks5qwsXVgaJpZM4KMFG_ .

jchauncey commented 7 years ago

@felipejfc is it possible for you to ssh into your master node and look at the kubelet configuration and see if you can find where it does the following: --pod-cidr=

Trying to see if we are also being affected by this problem - https://github.com/kubernetes/kops/issues/204

felipejfc commented 7 years ago

@jchauncey


admin@ip-172-21-124-39:~$ ps aux | grep kubelet
root       843  1.9  1.0 448284 89452 ?        Ssl  Oct02 110:46 /usr/local/bin/kubelet --allow-privileged=true --api-servers=http://127.0.0.1:8080 --babysit-daemons=true --cgroup-root=docker --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/manifests --configure-cbr0=true --enable-debugging-handlers=true --hostname-override=ip-172-21-124-39.ec2.internal --network-plugin-mtu=9001 --network-plugin=kubenet --node-labels=kubernetes.io/role=master --non-masquerade-cidr=100.64.0.0/10 --pod-cidr=10.123.45.0/29 --reconcile-cidr=true --register-schedulable=false --v=2
jchauncey commented 7 years ago

--pod-cidr=10.123.45.0/29 is not enough IPs for the number of pods we are trying to run on the master. It should probably be upped to a /28

WillPlatnick commented 7 years ago

Spoke to kops maintainer @justinsb - He's going to put in a PR in kops to raise it, but he requests that Deis put in a PR for this with kube-up too so a discussion can be had there.

bacongobbler commented 7 years ago

Just to confirm it's a v1.4.0 issue, can you try running this on kubernetes v1.3.8? From what I'm reading in kubernetes/kubernetes, kube-up with GCE uses v1.3.8 uses /30 and v1.4.0 uses /29 as the pod CIDR. Not sure if that's what is making a difference here, but https://github.com/kubernetes/kubernetes/pull/32886 is the PR in question. Just thought I'd report on what upstream's pod CIDR ranges are.

WillPlatnick commented 7 years ago

kops merged in a default /28 for us. Updated the cluster, verified kubelet is running with a /28 and the issue is still occuring. Nothing in the logs other than creating topic

jchauncey commented 7 years ago

k let me think of some other things that might help us debug this problem

felixbuenemann commented 7 years ago

The deis monitor was already a daemon set before 1.4.x and it runs on all nodes on 1.3.x as well.

felipejfc commented 7 years ago

Yes but on 1.3.0 it was not being stuck in crash loop restart

felixbuenemann commented 7 years ago

It is working for me on 1.4.3/1.4.4 on CoreOS beta with a podCIDR of 10.2.0.0/16 and 1.3.8/1.3.9 with same podCIDR on CoreOS stable.

If the container crashes and the only log message is "Creating topic with URL …" then the curl request must fail. So my guess would be a connectivity issue to nsqd. A modified deis-monitor-telegraf image which uses "curl -v -s" should be helpful to see what's going on.

See https://github.com/deis/monitor/blob/master/telegraf/rootfs/start-telegraf#L17

felixbuenemann commented 7 years ago

I've done some debugging with @WillPlatnick and it seems connectivity from pods on the controller to the service network is not working, while is works on the workers. This seems to be specific to kops.

jchauncey commented 7 years ago

Is there anyway to get enough debug information we can open an issue with kops?

felixbuenemann commented 7 years ago

I think @WillPlatnick is already working on opening an issue with kops.

WillPlatnick commented 7 years ago

The base issue is a kubernetes one apparently. They tried to fix it yesterday, but it didn't go too well and had to be reverted.

https://github.com/kubernetes/kubernetes/pull/35526 is the active PR to fix this. Hopefully will be in 1.5.

justinsb commented 7 years ago

I think the problem is specific to configurations where the master is registered as a node, when running kubenet. Hopefully we'll get it fixed upstream.

bacongobbler commented 7 years ago

kubernetes/kubernetes#35526 has since been merged and is available upstream in k8s v1.5.0+. Closing.