coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
601 stars 266 forks source link

Tectonic-installer not finishing on baremetal #2657

Open plamer opened 6 years ago

plamer commented 6 years ago

What keywords did you search in tectonic-installer issues before filing this one?

"etcd-wrapper[9500]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory" "Error: failed to run Kubelet: unable to load client CA file" "Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d"

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT or FEATURE REQUEST BUG REPORT

Versions

What happened?

It installs CoreOS on the machines and after the reboot tectonic-installer is stuck on null_resource.etcd_secrets: Still creating...

What you expected to happen?

To finish the provisioning/installation of the cluster

How to reproduce it (as minimally and precisely as possible)?

I followed this guide https://coreos.com/tectonic/docs/latest/install/bare-metal/index.html

Anything else we need to know?

After being stuck for more than 30 mins I've logged with ssh in my to-be-master node and saw that there was no /etc/ssl/etcd directory and etcd fails to start because of that:

etcd-wrapper[9500]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory

I decided to copy manually the certs from my matchbox host and then etcd started, but got another error:

Jan 04 11:36:26 node09.example.net kubelet-wrapper[19787]: Error: failed to run Kubelet: unable to load client CA file /etc/kubernetes/ca.crt: error reading /etc/kubernetes/ca.crt: data does not contain any valid

Then noticed that there is no /etc/kubernetes/kubeconfig and even if I provide manually ca.crt it gets overwritten by the kubelet.service:

ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"

After commenting this line and adding ca.crt manually.. got the following error:

kubelet-wrapper[20868]: W0104 11:42:08.587000 20868 cni.go:196] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d

Why is this happening? What am I missing when setting up my cluster? Is it a bug or I'm not getting something right?

sym3tri commented 6 years ago

Hi @plamer, can you confirm if you made sure your ssh-agent is running as documented?

https://coreos.com/tectonic/docs/latest/install/bare-metal/requirements.html#ssh-agent

plamer commented 6 years ago

@sym3tri Yes, checked ssh-add -L on my matchbox/tectonic host and the key matches the one provided in the tectonic-installer. Here are some logs from one of the nodes:

systemd[1]: Started OpenSSH per-connection server daemon (192.168.11.11:57170)
sshd[22384]: Connection closed by 192.168.11.11 port 57170 [preauth]
fforootd commented 6 years ago

We had something similar.

In our case we needed to start all the necessary server's that are planned with terraform and not just the master node's.

Otherwise you will be stuck in this step.

plamer commented 6 years ago

Okay, but why doesn't kubeconfig/the certificates get copied to the nodes? They are in /opt/tectonic/generated but not in /etc/ssl/etcd (for the certs) and /etc/kubernetes/kubeconfig (for k8s)? And I have both my machines (master node and worker node) running - coreos get's installed, they reboot and on the master etcd fails to start, on worker - kubelet.

fbarbeira commented 6 years ago

I'm hitting the same error on a baremetal installation.

The directory "/etc/ssl/etcd" is not present:

Jan 10 08:04:23 c1-k8s.server.com etcd-wrapper[1925]: + exec /usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd --mount volume=etcd-ssl,target=/etc/ssl/etcd --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=coreos-notify,target=/run/systemd/notify --volume coreos-notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume coreos-data-dir,kind=host,source=/var/lib/etcd,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume coreos-etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=coreos-data-dir,target=/var/lib/etcd --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-etc-hosts,target=/etc/hosts --mount volume=coreos-etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.1.8 --user=232 -- --name=dh-k8s-etcd-0 --advertise-client-urls=https://c1-k8s.server.com:2379 --cert-file=/etc/ssl/etcd/server.crt --key-file=/etc/ssl/etcd/server.key --peer-cert-file=/etc/ssl/etcd/peer.crt --peer-key-file=/etc/ssl/etcd/peer.key --peer-trusted-ca-file=/etc/ssl/etcd/ca.crt --peer-client-cert-auth=true --initial-cluster=dh-k8s-etcd-0=https://c1-k8s.server.com:2380 --initial-advertise-peer-urls=https://c1-k8s.server.com:2380 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380
Jan 10 08:04:24 c1-k8s.server.com etcd-wrapper[1925]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory
Jan 10 08:04:24 c1-k8s.server.com systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Jan 10 08:04:24 c1-k8s.server.com systemd[1]: Failed to start etcd (System Application Container).
Jan 10 08:04:24 c1-k8s.server.com systemd[1]: etcd-member.service: Unit entered failed state.
Jan 10 08:04:24 c1-k8s.server.com systemd[1]: etcd-member.service: Failed with result 'exit-code'.
fbarbeira commented 6 years ago

False alarm, in my case it was a misconfiguration of our dhcpd server. It's working now.

pmb311 commented 6 years ago

I am seeing this as well on version tectonic_1.8.4-tectonic.3. I had to do this weird dance with terraform during the pxeboot with the following steps (this was a re-install): 1) Make sure there is nothing lingering around from previous coreos/tectonic installs, e.g. zero out the drives 2) Terraform destroy && rm tf.state* 3) Shutdown the cluster nodes 4) Run terraform apply such that it hangs on the etcd_secrets part 5) Pxeboot the nodes

The provisioning routine will eventually complete and the etcd/kubeconfig/bootstrap config files will land and the k8s bootstrapping will commence.

waterdrops commented 6 years ago

I met the same error tectonic info: 1.9.6 latest CoreOS info: 1800.7.0 Terraform v0.10.7

Sep 05 05:22:29 node1.example.com etcd-wrapper[3379]: Downloading ACI:  4.51 MB/12.5 MB
Sep 05 05:22:30 node1.example.com etcd-wrapper[3379]: Downloading ACI:  6.89 MB/12.5 MB
Sep 05 05:22:31 node1.example.com etcd-wrapper[3379]: Downloading ACI:  9.25 MB/12.5 MB
Sep 05 05:22:32 node1.example.com etcd-wrapper[3379]: Downloading ACI:  11.6 MB/12.5 MB
Sep 05 05:22:33 node1.example.com etcd-wrapper[3379]: Downloading ACI:  12.5 MB/12.5 MB
Sep 05 05:22:33 node1.example.com etcd-wrapper[3379]: image: signature verified:
Sep 05 05:22:33 node1.example.com etcd-wrapper[3379]:   Quay.io ACI Converter (ACI conversion signing key) <support@quay.io>
Sep 05 05:22:34 node1.example.com etcd-wrapper[3379]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory
Sep 05 05:22:34 node1.example.com systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Sep 05 05:22:34 node1.example.com systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Sep 05 05:22:34 node1.example.com systemd[1]: Failed to start etcd (System Application Container).
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=debug msg="Opened keyring with 1 keys"
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=debug msg="good signature from 50BDD3E0FC8A365E"
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=info msg="Installing docker:17.03 for os versions [1800.7.0]"
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=debug msg="executing: /usr/sbin/torcx [image list -n 1800.7.0 docker]"
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=info msg="fetching addon at https://tectonic-torcx.release.core-os.net/pkgs/amd64-usr/docker/ce>
Sep 05 05:22:34 node1.example.com docker[3282]: time="2018-09-05T05:22:34Z" level=debug msg="GET https://tectonic-torcx.release.core-os.net/pkgs/amd64-usr/docker/cee3bab0933f56b>
Sep 05 05:22:44 node1.example.com systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Sep 05 05:22:44 node1.example.com systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 1.
Sep 05 05:22:44 node1.example.com systemd[1]: Stopped etcd (System Application Container).
Sep 05 05:22:44 node1.example.com systemd[1]: Starting etcd (System Application Container)...
Sep 05 05:22:44 node1.example.com rkt[3650]: rm: unable to resolve UUID from file: open /var/lib/coreos/etcd-member-wrapper.uuid: no such file or directory
Sep 05 05:22:44 node1.example.com rkt[3650]: rm: failed to remove one or more pods
Sep 05 05:22:44 node1.example.com etcd-wrapper[3658]: ++ id -u etcd
Sep 05 05:22:44 node1.example.com etcd-wrapper[3658]: + exec /usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd --mount volume=etcd-ssl,target=/etc/ssl/etcd --tru>
Sep 05 05:22:44 node1.example.com update_engine[3275]: I0905 05:22:44.845259  3275 update_attempter.cc:483] Already updated boot flags. Skipping.
Sep 05 05:22:45 node1.example.com etcd-wrapper[3658]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory
Sep 05 05:22:45 node1.example.com systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Sep 05 05:22:45 node1.example.com systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Sep 05 05:22:45 node1.example.com systemd[1]: Failed to start etcd (System Application Container).
Sep 05 05:22:55 node1.example.com systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Sep 05 05:22:55 node1.example.com systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 2.
Sep 05 05:22:55 node1.example.com systemd[1]: Stopped etcd (System Application Container).
Sep 05 05:22:55 node1.example.com systemd[1]: Starting etcd (System Application Container)...
Sep 05 05:22:55 node1.example.com rkt[3676]: rm: unable to resolve UUID from file: open /var/lib/coreos/etcd-member-wrapper.uuid: no such file or directory
Sep 05 05:22:55 node1.example.com rkt[3676]: rm: failed to remove one or more pods

I figure out that tectonic using FQDN to deploy all components, I added node records to DNS, the issue was resolved

waterdrops commented 6 years ago

@plamer Have you resolved your issue?

khanattlab commented 5 years ago

Have the same issue. Anyone figure out the problem. @plamer - did you fix it or just stop trying?

plamer commented 5 years ago

Sorry, I've missed the previous comment - I haven't tried - I went with pure kubernetes baremetal install with kubeadm.