cncf / demo

Demo of CNCF technologies
https://cncf.io
Apache License 2.0
77 stars 39 forks source link

CPU spike when pulling big containers can kill nodes & the whole cluster #163

Open namliz opened 7 years ago

namliz commented 7 years ago

A Very Large Container can cause a huge CPU spike.

This is hard to pin point exactly, could be just docker pull working very hard, a kubelet bug, or something else.

cpu-spike

Cloudwatch doesn't quite capture how bad this is, nodes freeze up to the point where you can't ssh into them. Everything becomes totally unresponsive, 'etc. Eventually (after 7 minutes in this case) it finally revs down and recovers. Except the Weave pods. Now the cluster is shot.

nodes-overloaded

kubectl delete -f https://git.io/weave-kube, kubectl apply -f https://git.io/weave-kube does not help.

kubectl logs weave-net-sbbsm --namespace=kube-system weave-npc

..
time="2016-11-17T04:16:44Z" level=fatal msg="add pod: ipset [add weave-k?Z;25^M}|1s7P3|H9i;*;MhG 10.40.0.2] failed: ipset v6.29: Element cannot be added to the set: it's already added\n: exit status 1"

To be fair, the nodes are t2.micro and have handled everything so far. Perhaps this is their natural limit, retrying with larger instances.

namliz commented 7 years ago

Bigger instance types (m4) for the nodes did appear to help at first but one node froze up (albeit faster, after 2 minutes) and brought down the Weave daemon set. Seems to be a pathological container image.

I'll try redoing with a FROM debian:wheezy image. It is quite interesting that a container can freeze a node!

namliz commented 7 years ago

Smaller image (210mb) worked just fine. I'm not at all sure this has to do with size, something was causing very bad cpu spikes on the nodes when they pulled them and quite possibly a kernel panic on the hosts.

I'm going to save it to the side and see if anybody is interested in exploring a pathological image. Could be a docker thing or a kubernetes thing.

jeremyeder commented 7 years ago

It's probably just resource contention, but perhaps we can narrow it down. On the problematic machine, can you share dmesg and the journal content from around the time of the pull? Also do a perf record -a -g -F 100 while the issue is occuring on the problematic machine. Let it run for perhaps 10-20 seconds, then ctrl+c. Then do perf report > out.txt ... Let's see what the CPUs are doing. What version of docker and kube do you have? What storage graph driver are you using for docker?