coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

Latest next VMWare OVA Fails To Boot #1802

Open fifofonix opened 1 week ago

fifofonix commented 1 week ago

Describe the bug

When launching a sans ignition Fedora41 next OVA in VMWare Workstation on Windows the VM fails to boot with the message "The firmware encountered an unexpected exception. The vfirtual machine cannot boot." When using the testing Fedora40 OVA the VM boots to a login prompt without issue.

Separately, CICD scripts that deploy the same OVAs using OpenTofu to a server VMWare vSphere infrastructure, also fail although without such a message. In the server deployment case the VMs will be listed in vSphere but will be in an 'off' status, with any power on attempts yielding an 'off' status. No console messages produced or error messages. Again the same projects using testing deploy just fine.

Reproduction steps

  1. Download OVA
  2. Attempt to launch via VMWare Workstation

Expected behavior

VM should boot to login as it does for prior FCOS versions

Actual behavior

As described above.

System details

Butane or Ignition config

None

Additional information

image

dustymabe commented 1 week ago

so there's no messages at all on the console of the VMWare machines? Does the VM even attempt to boot at all or is it something happening at the VMware level that is causing it to not work at all?

What happens if you boot a testing machine, but rebase it to next?

fifofonix commented 6 days ago

No console/boot messages at all so it seems like there is something wrong with the OVA.

To all intents and purposes the VM in vSphere looks the same as a testing one, ie. same vmware virtual machine version #.

Rebasing a testing machine to next works fine.

Also, I have re-confirmed today that the OVA deployment issue exists with the very latest next, ie. 41.20240922.1.0.

dustymabe commented 5 days ago

Also, I have re-confirmed today that the OVA deployment issue exists with the very latest next, ie. 41.20240922.1.0.

Can you also confirm it DOES NOT exist with the lastest testing: 40.20240920.2.0 ?

fifofonix commented 5 days ago

Confirmed. Overnight testing CICD deployed canary VM without issues.

dustymabe commented 5 days ago

We use the exact same build container to build testing and next so there should be no difference in how the OVA is constructed. That would indicate to me there is a problem inside the OS (i.e. kernel, grub, or something), but rebasing from testing to next would test that theory and you said that rebasing works too.

I'm really not sure. I would expect something to come across the console that we could use to investigate, but you say there is nothing there either :(

dustymabe commented 5 days ago

That would indicate to me there is a problem inside the OS (i.e. kernel, grub, or something), but rebasing from testing to next would test that theory and you said that rebasing works too.

ahh. rebasing from testing to next wouldn't update the bootloader that's installed.

Can you run sudo bootupctl update on that rebased system and then reboot to see if it then fails to boot?

fifofonix commented 4 days ago

This replicate the issue with the node failing to reboot and failing to reboot when manual power on signal is given via vSphere console. For the record this was the output I got when applying bootupctl update. Hopefully, this means you can narrow in on what the issue is?

me@t-canary-vm:~$ sudo bootupctl update
Running as unit: bootupd.service
Previous BIOS: grub2-tools-1:2.06-123.fc40.x86_64
Updated BIOS: grub2-tools-1:2.12-4.fc41.x86_64
Previous EFI: grub2-efi-x64-1:2.06-123.fc40.x86_64,shim-x64-15.8-3.x86_64
Updated EFI: grub2-efi-x64-1:2.12-4.fc41.x86_64,shim-x64-15.8-3.x86_64
dustymabe commented 1 day ago

Thanks @fifofonix. I've got a few more questions (sorry!).

I've had at least one person report that installing Fedora Server 41 beta seems to work OK so maybe it's not GRUB and it is the way we've created the disk image itself (in the OVA). Is there a way you could try the "bare metal install" workflow using our ISO image (or PXE)? This would isolate the specific package set as the problem (i.e. where we previously suspected GRUB 2.12 as the problem) versus the built disk image as the problem.

hrismarin commented 11 hours ago

Is there a way you could try the "bare metal install" workflow using our ISO image (or PXE)?

At least on my side bare metal install using ISO image works.

$ sudo rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Tue 2024-10-01 07:18:06 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/next
                  Version: 41.20240922.1.0 (2024-09-23T17:19:23Z)
                   Commit: 9193342bf66c4b38fbf49d1d59af8a4e3f0c8ca4cb9d674ad3ba9713eea798c9
             GPGSignature: Valid signature by 466CF2D8B60BC3057AA9453ED0622462E99D6AD1

bootupd also seems to work and the system boots after the following commands.

core@fcos-next:~$ sudo bootupctl -vvvvvvv status
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
Component BIOS
  Installed: grub2-tools-1:2.12-4.fc41.x86_64
  Update: At latest version
Component EFI
  Installed: grub2-efi-x64-1:2.12-4.fc41.x86_64,shim-x64-15.8-3.x86_64
  Update: At latest version
No components are adoptable.
CoreOS aleph version: 41.20240922.1.0
Boot method: BIOS
core@fcos-next:~$ sudo bootupctl -vvvvvvv update
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
No update available for any component.
core@fcos-next:~$ sudo bootupctl -vvvvvvv validate
[TRACE bootupd] executing cli
Running as unit: bootupd.service
[TRACE bootupd] executing cli
[TRACE bootupd::bootupd] Gathering status for installed component: BIOS
[TRACE bootupd::bootupd] Gathering status for installed component: EFI
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::bootupd] Remaining known components: 0
Skipped: BIOS
[DEBUG bootupd::efi] Mounted at "/boot/efi"
[DEBUG bootupd::efi] Unmounting
[TRACE bootupd::efi] Unmounted
Validated: EFI
dustymabe commented 5 hours ago

At least on my side bare metal install using ISO image works.

Are you on VMWare?

fifofonix commented 53 minutes ago

Booting the aarch64 live ISO on VMWare Fusion shows the Grub prompt and goes through to the live bash prompt. Is this sufficient to prove that Grub is not the issue or do I need to install to disk to complete this test?

Note this is slightly different to the original issue which is reported for x86. Do I need to find an old Mac to test the x86 live ISO too?

dustymabe commented 44 minutes ago

Booting the aarch64 live ISO on VMWare Fusion shows the Grub prompt and goes through to the live bash prompt. Is this sufficient to prove that Grub is not the issue or do I need to install to disk to complete this test?

Note this is slightly different to the original issue which is reported for x86. Do I need to find an old Mac to test the x86 live ISO too?

Yeah - not switching out the architecture would be nice. Sorry I just thought you had a VMWare infra (other than your laptop) where you could run a test. It would be nice if we could try the test on the same architecture and same infra where you hit the original failures. I think that would be on x86_64, and yes, preferrably a full install to disk + reboot.

fifofonix commented 2 minutes ago

Had a colleague run the x86 ISO and install to disk and reboot on VMWare Workstation and everything goes well. This is an environment that fails when you try to install the OVA.