coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
600 stars 267 forks source link

Installer stuck in reboot loop on bare metal #306

Closed tnsasse closed 7 years ago

tnsasse commented 7 years ago

Hi guys,

I am trying to setup a tectonic cluster with 3 nodes, 1 controller and 2 workers on bare metal. To be precise the "bare metal" is xenserver virtual machines.

Problem: The machines are stuck in a loop, where they install coreos container linux and reboot all the time and start over again.

What I did:

What am I missing? Thanks for advice! Tobi

tnsasse commented 7 years ago

I am attaching the log file from one of the nodes (controller), maybe the install didn't go through as successfully as I thought seeing the partitioning errors...

core@localhost ~ $ sudo journalctl -f
-- Logs begin at Mon 2017-04-24 07:21:44 UTC. --
Apr 24 07:22:04 localhost systemd[1072]: Reached target Paths.
Apr 24 07:22:04 localhost systemd[1072]: Reached target Timers.
Apr 24 07:22:04 localhost systemd[1072]: Reached target Sockets.
Apr 24 07:22:04 localhost systemd[1072]: Reached target Basic System.
Apr 24 07:22:04 localhost systemd[1072]: Reached target Default.
Apr 24 07:22:04 localhost systemd[1072]: Startup finished in 20ms.
Apr 24 07:22:04 localhost systemd[1]: Started User Manager for UID 500.
Apr 24 07:22:05 localhost sudo[1102]:     core : TTY=pts/0 ; PWD=/home/core ; USER=root ; COMMAND=/bin/journalctl -f
Apr 24 07:22:05 localhost sudo[1102]: pam_unix(sudo:session): session opened for user root by core(uid=0)
Apr 24 07:22:05 localhost sudo[1102]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Apr 24 07:22:14 localhost systemd-networkd[926]: eth0: Configured
Apr 24 07:22:14 localhost systemd-timesyncd[832]: Network configuration changed, trying to establish connection.
Apr 24 07:22:14 localhost systemd-networkd-wait-online[1024]: ignoring: lo
Apr 24 07:22:14 localhost systemd[1]: Started Wait for Network to be Configured.
Apr 24 07:22:14 localhost systemd[1]: Reached target Network is Online.
Apr 24 07:22:14 localhost systemd[1]: Started installer.service.
Apr 24 07:22:14 localhost systemd[1]: Reached target Multi-User System.
Apr 24 07:22:14 localhost systemd[1]: Startup finished in 3.868s (kernel) + 11.554s (initrd) + 19.368s (userspace) = 34.790s.
Apr 24 07:22:14 localhost systemd-timesyncd[832]: Synchronized to time server 85.220.190.246:123 (0.coreos.pool.ntp.org).
Apr 24 07:22:14 localhost installer[1124]: + curl 'http://matchbox.g.hw.byte23.net:8080/ignition?uuid=255d7eab-7f7c-72b4-4b0c-4c52180babb6&mac=5e-74-39-8e-2b-42&os=installed' -o ignition.json
Apr 24 07:22:14 localhost installer[1124]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Apr 24 07:22:14 localhost installer[1124]:                                  Dload  Upload   Total   Spent    Left  Speed
Apr 24 07:22:14 localhost installer[1124]: [158B blob data]
Apr 24 07:22:14 localhost installer[1124]: + coreos-install -d /dev/xvda -C stable -V 1298.7.0 -i ignition.json -b http://matchbox.g.hw.byte23.net:8080/assets/coreos
Apr 24 07:22:15 localhost installer[1124]: Downloading the signature for http://matchbox.g.hw.byte23.net:8080/assets/coreos/1298.7.0/coreos_production_image.bin.bz2...
Apr 24 07:22:15 localhost installer[1124]: 2017-04-24 07:22:15 URL:http://matchbox.g.hw.byte23.net:8080/assets/coreos/1298.7.0/coreos_production_image.bin.bz2.sig [564/564] -> "/tmp/coreos-install.JxpPC4zs6y/coreos_production_image.bin.bz2.sig" [1]
Apr 24 07:22:15 localhost installer[1124]: Downloading, writing and verifying coreos_production_image.bin.bz2...
Apr 24 07:22:15 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Apr 24 07:22:15 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:22:15 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Apr 24 07:22:15 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:22:15 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Apr 24 07:22:15 localhost kernel:  xvda: xvda1 xvda2 xvda3 xvda4 xvda6 xvda7 xvda9
Apr 24 07:23:10 localhost installer[1124]: 2017-04-24 07:23:10 URL:http://matchbox.g.hw.byte23.net:8080/assets/coreos/1298.7.0/coreos_production_image.bin.bz2 [275389698/275389698] -> "-" [1]
Apr 24 07:23:16 localhost installer[1124]: gpg: Signature made Fri Mar 31 02:26:07 2017 UTC
Apr 24 07:23:16 localhost installer[1124]: gpg:                using RSA key 48F9B96A2E16137F
Apr 24 07:23:16 localhost installer[1124]: gpg:                issuer "buildbot@coreos.com"
Apr 24 07:23:16 localhost installer[1124]: gpg: key 50E0885593D2DCB4 marked as ultimately trusted
Apr 24 07:23:16 localhost installer[1124]: gpg: checking the trustdb
Apr 24 07:23:16 localhost installer[1124]: gpg: marginals needed: 3  completes needed: 1  trust model: pgp
Apr 24 07:23:16 localhost installer[1124]: gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
Apr 24 07:23:16 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Apr 24 07:23:16 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:23:16 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Apr 24 07:23:16 localhost installer[1124]: gpg: Good signature from "CoreOS Buildbot (Offical Builds) <buildbot@coreos.com>" [ultimate]
Apr 24 07:23:16 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:23:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Apr 24 07:23:16 localhost kernel:  xvda: xvda1 xvda2 xvda3 xvda4 xvda6 xvda7 xvda9
Apr 24 07:23:16 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Apr 24 07:23:16 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:23:16 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Apr 24 07:23:16 localhost kernel: GPT:9289727 != 52428799
Apr 24 07:23:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Apr 24 07:23:16 localhost kernel:  xvda: xvda1 xvda2 xvda3 xvda4 xvda6 xvda7 xvda9
tnsasse commented 7 years ago

Issue solved. Realized that the boot loop results from the boot priority being iPXE over HDD, no wonder the system never booted into it's installation. To fix this I removed my DHCP line pointing to matchbox and waited for the next reboot to go trough. Maybe this helps other people.

dghubble commented 7 years ago

I'm not sure if this was published when you were testing, but Bare-Metal Requirements (under Machines) now lists the requirement that machines be configured to favor boot from disk and that IPMI or some other remote management be used to request a PXE boot when a fresh re-provision is desired. This may be improved in the future.

Upstream Matchbox has some examples of clusters which PXE boot every time and its possible to PXE boot worker instances diskless, but for now Tectonic supports / recommends simple disk installs.