Open tarasglek opened 7 years ago
Note, same applies to terraform kubelet
cc @dghubble
Kubelets will log error messages about being unable to reach a kube-apiserver, until an an apiserver is bootstrapped. You'll find logs like these before Kubernetes is successfully bootstrapped. They are normal. In a single-master cluster, if you restart or kill the kubelet (or reboot the node, etc), you'll see kubelet restart and it will show these same errors for a time because, indeed, there is no apiserver pod. Checkpointer pods will bring back the control plane, but this may take a minute or so, it is complex. At step 5, can you tail the kubelet logs and report back if it doesn't recover after a few minutes?
Even with a single master, our Kubernetes clusters are designed to tolerate the temporary loss of the master (reboot, restart kubelet, kill pid) and recover without intervention, as enabled Container Linux auto-updates will naturally reboot nodes over time anyway.
Note:
Sorry, you are right.
Sorry, I filed the bugreport wrong. Indeed killing just the master kubelet seems safe.
If I kill just master kubelet, it seems to recover. However if I kill all kubelets, the cluster doesn't recover(waited 10min). This is 100% repeatable.
#!/bin/sh
set -x -e
for node in 'node1' 'node2' 'node3'; do
ssh core@$node.example.com sudo pkill -f kubelet
done
Luckily bootkube tickles it into coming back
ssh core@node1.example.com sudo systemctl start bootkube
I can retry your specific example soon. Clusters are expected to tolerate all nodes being rebooted since complete power losses or simply turning off a cluster (e.g. load is periodic) for a while is ok. We do usually test this with shutdowns rather than process killing so maybe there's something there.
I can produce this, it does seem to be an issue. You can shutdown any/all nodes, you can restart any/all kubelets, you can restart any/all docker daemons - all recover.
Killing the kubelet process however, the kubelet.service starts again, but ~doesn't seem to trigger checkpoint recovery~ apiserver remains inaccessible. This can be alleviated by rebooting the controller node, which recovers the control plane on a fresh bootup, but isn't ideal.
Should be discussed with https://github.com/kubernetes-incubator/bootkube
Killing the kubelet shouldn't affect running workloads at all. It should just come back up and re-inspect the existing state from docker. So there shouldn't be any checkpoint recovery coming into this at all (you're not killing docker containers) - unless I'm missing some part of the reproduction besides killing the kubelet process.
Might need a bit more info here (or we can try and reproduce as well)
I've reproduced this on QEMU/KVM nodes. Running pkill -f kubelet
on the master makes the apiserver inaccessible to users as OP described. You can tail the apiserver and it stops immediately upon the kubelet being killed. No useful messages at verbosity 8.
Issue Report
Bug
I tried the sample kubernetes deployment in ignition repo. It seems overly fragile as compared to what happens on ubuntu. There kubelet is able to recover after dying.
Container Linux Version
Environment
What hardware/cloud provider/hypervisor is being used to run Container Linux?
Expected Behavior
master kubelet is able to restart successfully
Actual Behavior
Reproduction Steps
Other Information
Feature Request
Environment
What hardware/cloud provider/hypervisor is being used to run Container Linux? kvm
Desired Feature
Other Information