coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
600 stars 267 forks source link

Tectonics cannot create ETCD cluster or bootstrap Kubernetes cluster #1379

Closed roffe closed 6 years ago

roffe commented 7 years ago

What keywords did you search in tectonic-installer issues before filing this one?

null_resource.etcd_secrets: Still creating

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT

Versions

What happened?

Followed your instructions regarding network, pxe boot dhcp etc. Tectonics fails to create a cluster or even make a etcd node..

PXE boot and everything works. the machines gets installed, rebooted, get their hostnames entered as master & worker. Then the scripts starts failing.

What you expected to happen?

Tectonics magic 15 minutes to kubernetes to kick in

How to reproduce it (as minimally and precisely as possible)?

Follow your guides: https://coreos.com/tectonic/docs/latest/install/bare-metal/ https://coreos.com/matchbox/docs/0.6.1/network-setup.html https://coreos.com/matchbox/docs/latest/deployment.html#coreos

Anything else we need to know?

PXE boot, DHCP, DNS everything works. Terraform manifest is created on server, pushed to machine son power on and they get "installed" after that everything that looks like it's suposed to be the kubernetes bootstrap fails.

Log from tectonics: tectonic-agi-cph.txt

Output from failing master node:

Jul 13 07:31:33 master01.roffe.nu systemd[1]: etcd-member.service: Service hold-off time over, scheduling restart.
Jul 13 07:31:33 master01.roffe.nu systemd[1]: Stopped etcd (System Application Container).
-- Subject: Unit etcd-member.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd-member.service has finished shutting down.
Jul 13 07:31:33 master01.roffe.nu systemd[1]: Starting etcd (System Application Container)...
-- Subject: Unit etcd-member.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd-member.service has begun starting up.
Jul 13 07:31:33 master01.roffe.nu rkt[1985]: "d2b5e1f5-904b-4086-bd38-09e7a827b99f"
Jul 13 07:31:33 master01.roffe.nu etcd-wrapper[1997]: ++ id -u etcd
Jul 13 07:31:33 master01.roffe.nu etcd-wrapper[1997]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=systemd-dir,target=/run/systemd/system --volume systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=notify,target=/run/systemd/notify --volume notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume data-dir,kind=host,source=/var/lib/etcd,readOnly=false --volume etc-ssl-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=data-dir,target=/var/lib/etcd --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.1.8 --user=232 --
Jul 13 07:31:34 master01.roffe.nu etcd-wrapper[1997]: run: stat of host path /etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory
Jul 13 07:31:34 master01.roffe.nu systemd[1]: etcd-member.service: Main process exited, code=exited, status=254/n/a
Jul 13 07:31:34 master01.roffe.nu systemd[1]: Failed to start etcd (System Application Container).
-- Subject: Unit etcd-member.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd-member.service has failed.
--
-- The result is failed.
Jul 13 07:31:34 master01.roffe.nu systemd[1]: etcd-member.service: Unit entered failed state.
Jul 13 07:31:34 master01.roffe.nu systemd[1]: etcd-member.service: Failed with result 'exit-code'.

Output form failing controller node install:

Jul 13 07:33:41 node01.roffe.nu systemd[1]: kubelet-env.service: Service hold-off time over, scheduling restart.
Jul 13 07:33:41 node01.roffe.nu systemd[1]: Stopped Determine the Kubelet Image Version.
-- Subject: Unit kubelet-env.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet-env.service has finished shutting down.
Jul 13 07:33:41 node01.roffe.nu systemd[1]: Starting Determine the Kubelet Image Version...
-- Subject: Unit kubelet-env.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet-env.service has begun starting up.
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered blocking state
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered disabled state
Jul 13 07:33:41 node01.roffe.nu kernel: device veth28d42e5 entered promiscuous mode
Jul 13 07:33:41 node01.roffe.nu kernel: IPv6: ADDRCONF(NETDEV_UP): veth28d42e5: link is not ready
Jul 13 07:33:41 node01.roffe.nu systemd-udevd[4262]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu systemd-udevd[4262]: Could not generate persistent MAC address for vethc7731f5: No such file or directory
Jul 13 07:33:41 node01.roffe.nu systemd-udevd[4263]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 13 07:33:41 node01.roffe.nu systemd-udevd[4263]: Could not generate persistent MAC address for veth28d42e5: No such file or directory
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu kernel: eth0: renamed from vethc7731f5
Jul 13 07:33:41 node01.roffe.nu kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth28d42e5: link becomes ready
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered blocking state
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered forwarding state
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu systemd-networkd[710]: veth28d42e5: Gained carrier
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu systemd-networkd[710]: docker0: Gained carrier
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu bash[4252]: F0713 07:33:41.310182       1 main.go:29] Failed to build config: stat /etc/kubernetes/kubeconfig: no such file or directory
Jul 13 07:33:41 node01.roffe.nu kernel: vethc7731f5: renamed from eth0
Jul 13 07:33:41 node01.roffe.nu systemd-networkd[710]: veth28d42e5: Lost carrier
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered disabled state
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu systemd-udevd[4312]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered disabled state
Jul 13 07:33:41 node01.roffe.nu kernel: device veth28d42e5 left promiscuous mode
Jul 13 07:33:41 node01.roffe.nu kernel: docker0: port 1(veth28d42e5) entered disabled state
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:41 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
Jul 13 07:33:41 node01.roffe.nu systemd[1]: kubelet-env.service: Control process exited, code=exited status=255
Jul 13 07:33:41 node01.roffe.nu systemd[1]: Failed to start Determine the Kubelet Image Version.
-- Subject: Unit kubelet-env.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet-env.service has failed.
--
-- The result is failed.
Jul 13 07:33:41 node01.roffe.nu systemd[1]: kubelet-env.service: Unit entered failed state.
Jul 13 07:33:41 node01.roffe.nu systemd[1]: kubelet-env.service: Failed with result 'exit-code'.
Jul 13 07:33:42 node01.roffe.nu systemd-networkd[710]: docker0: Lost carrier
Jul 13 07:33:42 node01.roffe.nu systemd-timesyncd[683]: Network configuration changed, trying to establish connection.
Jul 13 07:33:42 node01.roffe.nu systemd-timesyncd[683]: Synchronized to time server 78.156.100.202:123 (0.coreos.pool.ntp.org).
roffe commented 7 years ago

I have found the issue with is stupidity from my side as well as imho, poor documentation..

The whole environment is started with PXE boot over DHCP but the tectonics installer requires mac adress & hostname, it uses this hostname to connect to the nodes to provision them, but the hostnames cannot be provisioned before the machines are deployed as they are DHCP assigned.

I guess this is a catch 22:

One cannot provide a complete hostname with IP until the machines are booted and one knows what random IP the DHCP server assigned it & tectonics cannot provision the hosts without fixed hostnames which is gonna be hard when the machines ip's are assigned by DHCP

Which leads to the next question: Am i supposed to run DHCP with fixed leases and fixed hostnames pre tied to ip adresses to get this to work?

roffe commented 7 years ago

I got around it by adding the ip's the nodes got to the hostnames after the innitial bootstrap. after that the cluster creation could continue

rushins commented 7 years ago

hello roffe, i got the same issue but i setup up DNS entries of the IP booted from tectonic installer. can you guide me some hint, i get this error Failed to start Determine the Kubelet Image Version. -- Subject: Unit kubelet-env.service has failed. i see the kubelet-env service is failing with condition failed.

============ sudo systemctl status kubelet-env.service ● kubelet-env.service - Determine the Kubelet Image Version Loaded: loaded (/etc/systemd/system/kubelet-env.service; enabled; vendor preset: enabled) Active: inactive (dead) (Result: exit-code) since Wed 2017-08-16 05:21:22 UTC; 12min ago Condition: start condition failed at Wed 2017-08-16 05:21:25 UTC; 12min ago └─ ConditionPathExists=!/etc/kubernetes/kubelet.env was not met Process: 10542 ExecStartPre=/usr/bin/bash -c docker run --rm -v /etc/kubernetes/kubeconfig:/kubeconfig quay.io/coreos/kube-version:0.1.0 --kubeconfig=/kubeconfig > /etc/kubernetes/kube.ve Process: 10535 ExecStartPre=/bin/mkdir -p /etc/kubernetes (code=exited, status=0/SUCCESS) CPU: 38ms

Aug 16 05:21:11 lvpal3013.pal.sa.corp systemd[1]: kubelet-env.service: Unit entered failed state. Aug 16 05:21:11 lvpal3013.pal.sa.corp systemd[1]: kubelet-env.service: Failed with result 'exit-code'. Aug 16 05:21:22 lvpal3013.pal.sa.corp systemd[1]: kubelet-env.service: Service hold-off time over, scheduling restart. Aug 16 05:21:22 lvpal3013.pal.sa.corp systemd[1]: Stopped Determine the Kubelet Image Version.

=================

do you have any idea how to fix this...

polonel commented 7 years ago

This is happening to me. I put the MAC Addresses into DHCP so the nodes and master would pull the correct IP during bootup. I have verified this. But the cluster will not form. Just getting constant spam about the docker port entered a disabled state.

alexandrufulop commented 6 years ago

Issue: The etcd certificates are missing. It seems that there is no /etc/ssl/etcd directory.

<< Solution >>

The tectonic installer cannot reach your newly created nodes. The instaler will ssh into the nodes using the private key of the ssh key you inserted in the tectonic cluster creation wizard. Before begining the setup you will need to add your ssh private key to your ssh agent running on the machine which is running the tectonic installer.

Form other users: "Make sure the SSH key specifyied in the tfvars file is also added to the SSH agent on the machine running terraform. Without this, terraform is not able to SSH copy the assets and start bootkube. Also make sure that the SSH known_hosts file doesn't have old records of the API DNS name (fingerprints will not match Tectonic installer will add the installer machine's public SSH key to all machines in the cluster. The key must be on the installing machine's ssh-agent, and it is used to configure nodes. Check if a key already exists in the ssh-agent using ssh-add -l. If a key must be added to the agent, use ssh-add Path/ToYour/KeyFile. Note that on OSX it may be necessary to re-add keys from your keyring to the agent on each login."*

Steps to take:

On the machine that runs the tectonic installer (in my case I ran the tectonic installer on the matchbox machine) do this:

1) make sure ssh-agent is running: $eval $(ssh-agent -s) 2) check your currentlly attached keys (none by default) $ssh-add -L 3) add your ssh private key that coresponds to the ssh-rsa key you will be using in the installer $ssh-add ~/.ssh/id_rsa

if you don't have any keys you can generate one like this: $ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

Good luck! Alexandru

On Fri, Oct 13, 2017 at 8:20 PM, tiagoaquino notifications@github.com wrote:

I'm having the same problem "/etc/ssl/etcd: stat /etc/ssl/etcd: no such file or directory" instaling on bare metal.

ps: PXE boot, DHCP, DNS, matchbox... everything works fine.

Any solution please?? Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/coreos/tectonic-installer/issues/1379#issuecomment-336514611, or mute the thread https://github.com/notifications/unsubscribe-auth/AAj9J8soFhwzl2TaDoNDTVBOJOxx1Uo8ks5sr5vNgaJpZM4OWnFS .

tiagoaquino commented 6 years ago

Thanks @alexandrufulop !

Solved my problem!!! :)