Closed saschagrunert closed 3 months ago
I ran into different problems with the bootstrap node on reboot, https://github.com/equinix/terraform-metal-openshift-on-baremetal/issues/10 - these were not OS boot related.
What device plan were you using, @saschagrunert ?
@displague do you mean the machine type? I recently tried c3.small as well as c3.medium (Dell R6515) and in both cases the nodes encounter a black screen followed by a reboot after the coreos boot screen selector. The iPXE installation exits with a success indicator.
So I assume it’s something in the kernel boot parameters. 🤔
I'm now encountering the very same issue as well while trying to spin up an OCP 4.9 on Equinix Metal in DC13 facility. The Out-of-Band console shows a black screen for all servers (except the lb), or stuck with:
$ ssh d186e620-c6ba-44c1-88e2-b6de75031825@sos.dc13.platformequinix.com
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]
This is the bootstrap.ipxe
configuration that is being used:
#!ipxe
set release 4.9
set zstream 0
set arch x86_64
set coreos-url http://147.28.129.183:8080
set coreos-img ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-rootfs.${arch}.img
set console console=ttyS1,115200n8
kernel ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-kernel-${arch} initrd=main coreos.live.rootfs_url=${coreos-img} coreos.inst.install_dev=sda coreos.inst.ignition_url=http://147.28.129.183:8080/bootstrap-append.ign ${console} console=tty0 console=ttyS0,115200n8 ip=dhcp
initrd --name main ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-initramfs.${arch}.img
boot
Please advise, Thanks
I see the console
is repeated in the kernel args there - on both ttyS0 and ttyS1. That doesn't sound right, but I don't know if that should create a problem (other than logs from the "getty")
I'm also getting black screens on the SoS console for the control plane and worker nodes. SSH is not responsive either.
I see that the control plane nodes are configured with an IPXE Script URL of http://{lb-0 address}:8080/master.ipxe
- this URL 404s
The ipxe scripts are not found on the bastion node (/usr/share/nginx/html/*.ipxe
)
From an empty state, I applied the following individually
terraform apply -target 'module.bastion.null_resource.ignition_append_files["master"]'
terraform apply -target 'module.bastion.null_resource.ipxe_files'
terraform apply -target 'module.bastion.null_resource.ocp_install_ignition'
With this approach, /usr/share/nginx/html/
contained the files that were not present in the previous pass. I didn't target all of the files that are supposed to be in this directory.
I then ran a full terraform apply
and observed the following warnings or errors:
module.prepare_openshift.null_resource.ocp_installer: Creating...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now
module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now
module.prepare_openshift.null_resource.ocp_installer (remote-exec): cp: cannot stat ‘oc’: No such file or directory
module.prepare_openshift.null_resource.ocp_installer: Creation complete after 2s [id=894906645089249242]
module.prepare_openshift.null_resource.ocp_pullsecret: Creating...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret: Creation complete after 3s [id=2213855489334895954]
module.prepare_openshift.data.template_file.installer_config: Reading...
module.prepare_openshift.data.template_file.installer_config: Read complete after 0s [id=a974cd42f9ea73442c095c53a67da5fd85a58726cde2a4f1aaa6a6d05324e2d6]
module.prepare_openshift.null_resource.ocp_install_config: Creating...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_config: Creation complete after 2s [id=2694546928241472463]
module.prepare_openshift.null_resource.ocp_install_manifests: Creating...
module.prepare_openshift.null_resource.ocp_install_manifests: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 6: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): sed: can't read /tmp/artifacts/install/manifests/cluster-scheduler-02-config.yml: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 8: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): cp: cannot stat ‘/tmp/artifacts/install/*.ign’: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 506 Cannot talk to daemon
module.prepare_openshift.null_resource.ocp_install_manifests: Still creating... [10s elapsed]
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 200 OK
module.prepare_openshift.null_resource.ocp_install_manifests: Creation complete after 12s [id=5925476240016676298]
null_resource.get_kubeconfig: Creating...
null_resource.get_kubeconfig: Provisioning with 'local-exec'...
null_resource.get_kubeconfig (local-exec): Executing: ["/bin/sh" "-c" "mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /Users/marques/.ssh/id_rsa_mos-v8nqj root@139.178.84.39:/tmp/artifacts/install/auth/* ./auth/"]
module.openshift_controlplane.metal_device.node[2]: Creating...
module.openshift_workers.metal_device.node[0]: Creating...
module.openshift_workers.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[0]: Creating...
module.openshift_bootstrap.metal_device.node[0]: Creating...
null_resource.get_kubeconfig (local-exec): Warning: Permanently added '139.178.84.39' (ED25519) to the list of known hosts.
null_resource.get_kubeconfig (local-exec): scp: /tmp/artifacts/install/auth/*: No such file or directory
module.openshift_controlplane.metal_device.node[2]: Still creating... [10s elapsed]
The control planes nodes do not seem to be accessible again.
When I ran terraform with:
ocp_version=4.9
ocp_version_zstream=0
or:
ocp_version=4.8
ocp_version_zstream=14
which are corresponding to RHCOS 4.9.0 and 4.8.14 respectively, the /usr/share/nginx/html
folder on the bastion/lb host was populated with these files:
root@lb-0 ~]# ll /usr/share/nginx/html/
total 1000852
-rwxr-xr-x. 1 root root 3971 Oct 7 2019 404.html
-rwxr-xr-x. 1 root root 4020 Oct 7 2019 50x.html
-rwxr-xr-x. 1 root root 1175 Jan 11 11:41 bootstrap-append.ign
-rwxr-xr-x. 1 root root 606 Jan 11 11:41 bootstrap.ipxe
-rwxr-xr-x. 1 root root 1172 Jan 11 11:41 master-append.ign
-rwxr-xr-x. 1 root root 603 Jan 11 11:41 master.ipxe
-rwxr-xr-x. 1 root root 89362572 Jan 11 11:41 rhcos-4.8.14-x86_64-live-initramfs.x86_64.img
-rwxr-xr-x. 1 root root 10030448 Jan 11 11:41 rhcos-4.8.14-x86_64-live-kernel-x86_64
-rwxr-xr-x. 1 root root 925434368 Jan 11 11:41 rhcos-4.8.14-x86_64-live-rootfs.x86_64.img
-rwxr-xr-x. 1 root root 1172 Jan 11 11:41 worker-append.ign
-rwxr-xr-x. 1 root root 603 Jan 11 11:41 worker.ipxe
as expected, and the files are indeed accessible, e.g. http://147.28.129.183:8080/bootstrap.ipxe
and I've also seen the following error regarding copying from /tmp/artifacts/install/auth/*
in the local host. What should populate that folder?
Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 root@147.28.129.183:/tmp/artifacts/install/auth/* ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts.
scp: /tmp/artifacts/install/auth/*: No such file or directory
@displague I had 404's when running terraform apply
sequentially without a git clean -fdx
in between. I think it does not download the image correctly when running multiple times. From a clean state it always downloads the image and I never encountered 404 errors. :shrug:
It looks like #20 will be addressing some of these concerns.
A number of the problems discussed here have been previously resolved in #20.
Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 root@147.28.129.183:/tmp/artifacts/install/auth/ ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts. scp: /tmp/artifacts/install/auth/: No such file or directory
This was experienced and fixed in #31
Hey, I have the issue that the bootstrap node is not bootable any more. The installations seems to be fine, then it restarts the machine from the iPXE process. Then the reboot got stuck with:
I tried provisioning multiple facilities (da11, ams6, fra1) without any success.
It also happens that the CoreOS kernel boot screen of grub appears, but then the screen turns black via the out of band console. Pinging the machine is possible but not accessing any service like ssh.