Bootstrap issues with 4.9.0 image

saschagrunert commented 2 years ago

Hey, I have the issue that the bootstrap node is not bootable any more. The installations seems to be fine, then it restarts the machine from the iPXE process. Then the reboot got stuck with:

Booting from Hard drive C:
..
error: ../../grub-core/disk/i386/pc/biosdisk.c:498:failure reading sector 0x0
from `cd'.

I tried provisioning multiple facilities (da11, ams6, fra1) without any success.

It also happens that the CoreOS kernel boot screen of grub appears, but then the screen turns black via the out of band console. Pinging the machine is possible but not accessing any service like ssh.

displague commented 2 years ago

I ran into different problems with the bootstrap node on reboot, https://github.com/equinix/terraform-metal-openshift-on-baremetal/issues/10 - these were not OS boot related.

What device plan were you using, @saschagrunert ?

saschagrunert commented 2 years ago

@displague do you mean the machine type? I recently tried c3.small as well as c3.medium (Dell R6515) and in both cases the nodes encounter a black screen followed by a reboot after the coreos boot screen selector. The iPXE installation exits with a success indicator.

So I assume it’s something in the kernel boot parameters. 🤔

orenc1 commented 2 years ago

I'm now encountering the very same issue as well while trying to spin up an OCP 4.9 on Equinix Metal in DC13 facility. The Out-of-Band console shows a black screen for all servers (except the lb), or stuck with:

$ ssh d186e620-c6ba-44c1-88e2-b6de75031825@sos.dc13.platformequinix.com
[SOS Session Ready. Use ~? for help.]
[Note: You may need to press RETURN or Ctrl+L to get a prompt.]

This is the bootstrap.ipxe configuration that is being used:

#!ipxe

set release 4.9
set zstream 0
set arch x86_64
set coreos-url http://147.28.129.183:8080
set coreos-img ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-rootfs.${arch}.img
set console console=ttyS1,115200n8

kernel ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-kernel-${arch} initrd=main coreos.live.rootfs_url=${coreos-img} coreos.inst.install_dev=sda coreos.inst.ignition_url=http://147.28.129.183:8080/bootstrap-append.ign ${console} console=tty0 console=ttyS0,115200n8 ip=dhcp
initrd --name main ${coreos-url}/rhcos-${release}.${zstream}-${arch}-live-initramfs.${arch}.img
boot

Please advise, Thanks

displague commented 2 years ago

I see the console is repeated in the kernel args there - on both ttyS0 and ttyS1. That doesn't sound right, but I don't know if that should create a problem (other than logs from the "getty")

displague commented 2 years ago

I'm also getting black screens on the SoS console for the control plane and worker nodes. SSH is not responsive either.

I see that the control plane nodes are configured with an IPXE Script URL of http://{lb-0 address}:8080/master.ipxe - this URL 404s

displague commented 2 years ago

The ipxe scripts are not found on the bastion node (/usr/share/nginx/html/*.ipxe)

displague commented 2 years ago

From an empty state, I applied the following individually

terraform apply -target 'module.bastion.null_resource.ignition_append_files["master"]'
terraform apply -target 'module.bastion.null_resource.ipxe_files'
terraform apply -target 'module.bastion.null_resource.ocp_install_ignition'

With this approach, /usr/share/nginx/html/ contained the files that were not present in the previous pass. I didn't target all of the files that are supposed to be in this directory.

I then ran a full terraform apply and observed the following warnings or errors:

module.prepare_openshift.null_resource.ocp_installer: Creating...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_installer: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now

module.prepare_openshift.null_resource.ocp_installer (remote-exec): gzip: stdin: not in gzip format
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Child returned status 1
module.prepare_openshift.null_resource.ocp_installer (remote-exec): tar: Error is not recoverable: exiting now
module.prepare_openshift.null_resource.ocp_installer (remote-exec): cp: cannot stat ‘oc’: No such file or directory
module.prepare_openshift.null_resource.ocp_installer: Creation complete after 2s [id=894906645089249242]
module.prepare_openshift.null_resource.ocp_pullsecret: Creating...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_pullsecret: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret (remote-exec): (output suppressed due to sensitive value in config)
module.prepare_openshift.null_resource.ocp_pullsecret: Creation complete after 3s [id=2213855489334895954]
module.prepare_openshift.data.template_file.installer_config: Reading...
module.prepare_openshift.data.template_file.installer_config: Read complete after 0s [id=a974cd42f9ea73442c095c53a67da5fd85a58726cde2a4f1aaa6a6d05324e2d6]
module.prepare_openshift.null_resource.ocp_install_config: Creating...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'file'...
module.prepare_openshift.null_resource.ocp_install_config: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_config (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_config: Creation complete after 2s [id=2694546928241472463]
module.prepare_openshift.null_resource.ocp_install_manifests: Creating...
module.prepare_openshift.null_resource.ocp_install_manifests: Provisioning with 'remote-exec'...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connecting to remote host via SSH...
...
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): Connected!
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 6: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): sed: can't read /tmp/artifacts/install/manifests/cluster-scheduler-02-config.yml: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): /tmp/terraform_1825922558.sh: line 8: /tmp/artifacts/openshift-install: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): cp: cannot stat ‘/tmp/artifacts/install/*.ign’: No such file or directory
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 506 Cannot talk to daemon
module.prepare_openshift.null_resource.ocp_install_manifests: Still creating... [10s elapsed]
module.prepare_openshift.null_resource.ocp_install_manifests (remote-exec): 200 OK
module.prepare_openshift.null_resource.ocp_install_manifests: Creation complete after 12s [id=5925476240016676298]
null_resource.get_kubeconfig: Creating...
null_resource.get_kubeconfig: Provisioning with 'local-exec'...
null_resource.get_kubeconfig (local-exec): Executing: ["/bin/sh" "-c" "mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /Users/marques/.ssh/id_rsa_mos-v8nqj root@139.178.84.39:/tmp/artifacts/install/auth/* ./auth/"]
module.openshift_controlplane.metal_device.node[2]: Creating...
module.openshift_workers.metal_device.node[0]: Creating...
module.openshift_workers.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[1]: Creating...
module.openshift_controlplane.metal_device.node[0]: Creating...
module.openshift_bootstrap.metal_device.node[0]: Creating...
null_resource.get_kubeconfig (local-exec): Warning: Permanently added '139.178.84.39' (ED25519) to the list of known hosts.
null_resource.get_kubeconfig (local-exec): scp: /tmp/artifacts/install/auth/*: No such file or directory
module.openshift_controlplane.metal_device.node[2]: Still creating... [10s elapsed]

The control planes nodes do not seem to be accessible again.

orenc1 commented 2 years ago

When I ran terraform with:

ocp_version=4.9
ocp_version_zstream=0

or:

ocp_version=4.8
ocp_version_zstream=14

which are corresponding to RHCOS 4.9.0 and 4.8.14 respectively, the /usr/share/nginx/html folder on the bastion/lb host was populated with these files:

root@lb-0 ~]# ll /usr/share/nginx/html/
total 1000852
-rwxr-xr-x. 1 root root      3971 Oct  7  2019 404.html
-rwxr-xr-x. 1 root root      4020 Oct  7  2019 50x.html
-rwxr-xr-x. 1 root root      1175 Jan 11 11:41 bootstrap-append.ign
-rwxr-xr-x. 1 root root       606 Jan 11 11:41 bootstrap.ipxe
-rwxr-xr-x. 1 root root      1172 Jan 11 11:41 master-append.ign
-rwxr-xr-x. 1 root root       603 Jan 11 11:41 master.ipxe
-rwxr-xr-x. 1 root root  89362572 Jan 11 11:41 rhcos-4.8.14-x86_64-live-initramfs.x86_64.img
-rwxr-xr-x. 1 root root  10030448 Jan 11 11:41 rhcos-4.8.14-x86_64-live-kernel-x86_64
-rwxr-xr-x. 1 root root 925434368 Jan 11 11:41 rhcos-4.8.14-x86_64-live-rootfs.x86_64.img
-rwxr-xr-x. 1 root root      1172 Jan 11 11:41 worker-append.ign
-rwxr-xr-x. 1 root root       603 Jan 11 11:41 worker.ipxe

as expected, and the files are indeed accessible, e.g. http://147.28.129.183:8080/bootstrap.ipxe

and I've also seen the following error regarding copying from /tmp/artifacts/install/auth/* in the local host. What should populate that folder?

Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 root@147.28.129.183:/tmp/artifacts/install/auth/* ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts.
scp: /tmp/artifacts/install/auth/*: No such file or directory

saschagrunert commented 2 years ago

@displague I had 404's when running terraform apply sequentially without a git clean -fdx in between. I think it does not download the image correctly when running multiple times. From a clean state it always downloads the image and I never encountered 404 errors. :shrug:

displague commented 1 year ago

It looks like #20 will be addressing some of these concerns.

displague commented 3 months ago

A number of the problems discussed here have been previously resolved in #20.

Error: Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /home/ocohen/.ssh/id_rsa_metal-4cfi5 root@147.28.129.183:/tmp/artifacts/install/auth/ ./auth/': exit status 1. Output: Warning: Permanently added '147.28.129.183' (ECDSA) to the list of known hosts. scp: /tmp/artifacts/install/auth/: No such file or directory

This was experienced and fixed in #31

equinix / terraform-equinix-metal-openshift-on-baremetal

Bootstrap issues with 4.9.0 image #12