Closed rothgar closed 7 years ago
This also frequently happens while waiting for bootkube services to start. Because it takes so long I often get sent back to the "Power On" step 4-5 times. Each time I either have to wait for "bootstrap Kubernetes" or reboot the controller.
Once the node comes up the first time, do you see further restarts of the kubelet in logs? That spinner is based on the on-host kubelet. Alternately, maybe the connection is flaky - I think if a single check fails it reverts to not-ready and maybe it should allow a little wiggle room. This happens waiting before you've starting the bootkube steps on Conect Node? Repeatedly?
Its possible (though unlikely) a node can legitimately reboot due to an auto-update. But it should come back pretty quickly and only happen once.
You should always be safe just waiting, rather than rebooting the boxes again. If a node doesn't come back for a while (relative to your expectations of your hardware), mind checking the journald logs of kubelet? The on-host kubelet is supposed to be quite resilient and will retry to start in the event of etcd or flannel failures as well.
I have left it in this state waiting to connect to the kubelet for 5+ minutes and it never reconnects. I also tried manually restarting the kubelet but the installer still will not connect. I haven't yet figured out why a reboot works but so far that seems to be the most reliable.
I have checked logs and the kubelet is still running during this time and the box doesn't reboot (I'm SSHd into it).
I suspect the problem is a less reliable connection. This happened previously when I had nodes connected over a wireless bridge and happens now via a powerline network adapter. In both cases I could get ping drops of ~1/20. Even though my SSH connections don't drop. The installer health checks seem far too rigid.
Going back to this screen happens before starting bootkube as well as after (which is additionally confusing). Because it happens after I've already started bootkube I was unsure if bootkube worked and would do the correct thing (from what I can tell it did outside of #25 ). In trying to provision my cluster last night I provisioned the systems from scratch 3 times and was set back to this screen >20 times in 4 hours of trying to get the cluster running.
cc @dghubble @kbrwn
@rothgar were you able to resolve this issue with the latest releases of Tectonic?
Yes, this problem did not happen in my resent tests with the installer.
Issue Report Template
Tectonic Version
1.5.1
Environment
What hardware/cloud provider/hypervisor is being used with Tectonic? Bare Metal
Expected Behavior
Once the nodes are up and connected (green checks in Power on step) I would expect to be able to contiune to the next step without going back.
Actual Behavior
When going to the next step "Connect Nodes" the services often fail on the nodes and cause the installer to automatically go back to the "Power On" step. If I start bootkube fast enough I'm OK but often I need a few extra minutes to download the assets and let bootkube run. I have yet had the power on step continue without needing to reboot all of the nodes to get services to restart properly.
Reproduction Steps
Other Information
This happened with 1.4.7 and 1.5.1. I thought one of my workers had a bad nic but even after swapping out the machine this problem still happens. When the nodes fail often times it's the kubelet service that no longer has the green checkmark (etcd is usually fine).