Open namliz opened 7 years ago
Bigger instance types (m4) for the nodes did appear to help at first but one node froze up (albeit faster, after 2 minutes) and brought down the Weave daemon set. Seems to be a pathological container image.
I'll try redoing with a FROM debian:wheezy
image. It is quite interesting that a container can freeze a node!
Smaller image (210mb) worked just fine. I'm not at all sure this has to do with size, something was causing very bad cpu spikes on the nodes when they pulled them and quite possibly a kernel panic on the hosts.
I'm going to save it to the side and see if anybody is interested in exploring a pathological image. Could be a docker thing or a kubernetes thing.
It's probably just resource contention, but perhaps we can narrow it down. On the problematic machine, can you share dmesg
and the journal
content from around the time of the pull? Also do a perf record -a -g -F 100
while the issue is occuring on the problematic machine. Let it run for perhaps 10-20 seconds, then ctrl+c. Then do perf report > out.txt
... Let's see what the CPUs are doing. What version of docker and kube do you have? What storage graph driver are you using for docker?
A Very Large Container can cause a huge CPU spike.
This is hard to pin point exactly, could be just docker pull working very hard, a kubelet bug, or something else.
Cloudwatch doesn't quite capture how bad this is, nodes freeze up to the point where you can't ssh into them. Everything becomes totally unresponsive, 'etc. Eventually (after 7 minutes in this case) it finally revs down and recovers. Except the Weave pods. Now the cluster is shot.
kubectl delete -f https://git.io/weave-kube
,kubectl apply -f https://git.io/weave-kube
does not help.To be fair, the nodes are t2.micro and have handled everything so far. Perhaps this is their natural limit, retrying with larger instances.