Cert signing issues with controller node and kubernetes api

rothgar commented 7 years ago

Issue Report Template

Tectonic Version

1.5.1

Environment

What hardware/cloud provider/hypervisor is being used with Tectonic? Bare Metal

Expected Behavior

Cert signing for API server should be less prone to errors.

Actual Behavior

It's easy to end up with an api cert not signed with proper DNS aliases

Reproduction Steps

When going through tectonic installer you are asked for the name of a worker/controller and later asked for mac address and name of worker/controller
If in the first step you specify the hostname for the controller but later specify the fqdn you will end up in a state where the kubernetes api cert would be signed with the hostname as an alias but the nodes will try to connect to the fqdn which will cause an error and the nodes will never connect.

Other Information

Users should only be prompted for machine names once to avoid this problem. When asked later for the mac address the hostname should already be filled out from the previous step.

dghubble commented 7 years ago

I suspect we need to clarify in the Installer. If I understand correctly, you're referring to the two steps in screenshots below.

screenshot from 2017-01-18 00-11-45

I'd describe the purpose of these values as:

Controller DNS - a DNS entry which resolves to the controller (or controllers via round-robin in future). This is the only value added to the SANs (there are so other defaults) in bootkube certificates for Kubernetes.
Tectonic DNS - a DNS entry which resolves to any worker(s) (possibly via round robin) since Ingress is run on all workers.
Individual worker names - these DNS names determine the name by which worker kubelets register themselves. They're also used by the installer to check the state of the workers (for green spinny's).

Reasonable values (in order) might be:

kubernetes.example.com - Kubernetes API will live here
tectonic.example.com - Tectonic dashboard will live here, via ingress
node1.example.com, node2.example.com, etc.

apiserver.crt

X509v3 Subject Alternative Name: 
                DNS:kubernetes.example.com, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.3.0.1

Theory

It sounds like a hostname was provided for Controller DNS, which gets put in the apiserver cert SANs. The workers will try to register with the controller via that same ControllerDNS (plus https) so I'm not sure where the mismatch occurs there. Are workers able to resolve that hostname? Have an example of some hostname you might use and the corresponding server entry in a worker's /etc/kubernetes/kubeconfig?

Followup

Concerning merging the two names, they serve different purposes. This is quite unclear when creating a single controller, single worker cluster. But take the example above for instance. One would not want to register the Controller DNS entry as node1.example.com and the Tectonic DNS entry as node2.example.com, even if those were indeed controller and worker (though its valid). It might be more prudent to use two DNS records for that purpose. If more controllers are added, one could add A records to their static IPs behind the controller DNS name. If fanout to workers for ingress is insufficient or needs to tolerate more worker outages, there is flexibility to change those mappings.

Its on us to make the Installer instructions super clear, validate more inputs, and find ways to avoid possible mistakes. Feel free to share if you find some phrasing would have been more helpful here. More than anything, I want the Installer to educate about what its doing so its not magic so all feedback is appreciated.

Oh gosh this is long. Sorry.

rothgar commented 7 years ago

You are correct in the steps that caused the problems. Fortunately I have used the scripts manually in the past so I knew what was being done but the steps were still unclear for what the certs were being used for. I assumed in the define workers step the mac address and hostname were only being used for bootcfg groups matching but forgot the kubelets manually create certs in this step.

I think most helpful of all would be moving the nodes to use the bootstraping api like kubeadm uses so certificates wouldn't need to be defined per node prior to provisioning.

I think the 2nd thing that is confusing is the fact that tectonic currently only allows 1 controller which means setting up a load balancer (with CNAME) in front of the controller wouldn't be helpful.

One thing I think could be helpful (although possibly less secure) is to enforce a FQDN in creating the kubernetes certificate but to also automatically sign the certificate with the hostname/CNAM so if the user puts k8s.example.com the cert would be signed with k8s.example.com as well as k8s automatically.

mfburnett commented 7 years ago

hey @rothgar - thanks for your suggestions here and filing this issue. we've noted your suggestions and I'll make sure to bring them up at our weekly planning meeting! installing HA controller nodes is on our immediate roadmap, so that should help address your second point.

tactical-drone commented 7 years ago

Relates to #50

coreos / tectonic-forum