coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
263 stars 60 forks source link

Kola aarch64 UEFI sometimes fails during test reboot #1014

Closed jlebon closed 2 years ago

jlebon commented 2 years ago

rpmostree.install-uninstall.zip

We have some kind of UEFI debug logging going on in the tests, so there's lots of output even in normal boots. We should figure out why that is.

Anyway, on the reboot where we failed, I see this bit is different:

ValidateFvHeader: No Firmware Volume header present
NorFlashFvbInitialize: The FVB Header is not valid.
NorFlashFvbInitialize: Installing a correct one for this volume.
InstallProtocolInterface: D1A86E3F-0707-4C35-83CD-DC2C29C891A3 0
InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B BF69ED08
InstallProtocolInterface: 964E5B21-6459-11D2-8E39-00A0C969723B BF69EC48
InstallProtocolInterface: 8F644FA9-E850-4DB1-9CE2-0B44698E8DA4 BF69ECC0
Ftw: FtwWorkSpaceLba - 0x1, WorkBlockSize  - 0x40000, FtwWorkSpaceBase - 0x0
Ftw: FtwSpareLba     - 0x2, SpareBlockSize - 0x40000
Ftw: NumberOfWorkBlock - 0x1, FtwWorkBlockLba - 0x1
Ftw: WorkSpaceLbaInSpare - 0x0, WorkSpaceBaseInSpare - 0x0
Ftw: Remaining work space size - 3FFE0
Ftw: Work block header check mismatch
Ftw: Work block header check mismatch
Ftw: Both working and spare blocks are invalid, init workspace
Ftw: start to reclaim work space
Ftw: reclaim work space successfully

Also this bit:

ConnectDevicesFromQemu: 1 OpenFirmware device path(s) connected
FSOpen: Open '\EFI\fedora\shimaa64.efi' Success
[Bds] Expand HD(2,GPT,19C6F85F-EBDB-4F6D-9244-A7355FD46348,0x1000,0x3F800)/\EFI\fedora\shimaa64.efi -> PciRoot(0x0)/Pci(0x2,0x0)/HD(2,GPT,19C6F85F-EBDB-4F6D-9244-A7355FD46348,0x1000,0x3F800)/\EFI\fedora\shimaa64.efi
SetBootOrderFromQemu: setting BootOrder: success
[Bds]OsIndication: 0000000000000000
[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0004: Fedora         0x0001
    Boot0002: UEFI Misc Device 2         0x0001
    Boot0000: UiApp          0x0109
    Boot0001: UEFI Misc Device       0x0001
    Boot0003: EFI Internal Shell         0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery       0x0001
[Bds]=============End Load Options Dumping=============

Which on a normal successful boot was:

ConnectDevicesFromQemu: 2 OpenFirmware device path(s) connected
SetBootOrderFromQemu: setting BootOrder: success
[Bds]OsIndication: 0000000000000000
[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0002: UEFI Misc Device 2         0x0001
    Boot0000: UiApp          0x0109
    Boot0001: UEFI Misc Device       0x0001
    Boot0003: EFI Internal Shell         0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery       0x0001
[Bds]=============End Load Options Dumping=============

So there's a new "Fedora" boot entry Boot0004 on the broken boot.

On a normal successful boot, we select Boot0002:

BdsDxe: starting Boot0002 "UEFI Misc Device 2" from PciRoot(0x0)/Pci(0x2,0x0)
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:MokListRT
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:MokListXRT
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:SbatLevelRT
InstallProtocolInterface: 605DAB50-E046-4300-ABB6-3DD810DD8B23 BBC4FA38
FSOpen: Open '\EFI\BOOT\fbaa64.efi' Success
FSOpen: Open '\EFI\BOOT\fbaa64.efi' Success
FSOpen: Open 'EFI' Success
FSOpen: Open 'fedora' Success
FSOpen: Open 'BOOTAA64.CSV' Success
FSOpen: Open '\EFI\fedora\BOOTAA64.CSV' Success
FSOpen: Open '\EFI\fedora\shimaa64.efi' Success
[Security] 3rd party image[0] can be loaded after EndOfDxe: PciRoot(0x0)/Pci(0x2,0x0)/HD(2,GPT,19C6F85F-EBDB-4F6D-9244-A7355FD46348,0x1000,0x3F800)/\EFI\fedora\shimaa64.efi.
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B BDD3C7C0
Loading driver at 0x000BBAE2000 EntryPoint=0x000BBAE3000
Loading driver at 0x000BBAE2000 EntryPoint=0x000BBAE3000 
InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF BE394918
ProtectUefiImageCommon - 0xBDD3C7C0
  - 0x00000000BBAE2000 - 0x00000000000CA000
SetUefiImageMemoryAttributes - 0x00000000BBAE2000 - 0x0000000000001000 (0x0000000000004008)
SetUefiImageMemoryAttributes - 0x00000000BBAE3000 - 0x0000000000065000 (0x0000000000020008)
SetUefiImageMemoryAttributes - 0x00000000BBB48000 - 0x0000000000064000 (0x0000000000004008)
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:MokListRT
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:MokListXRT
[Variable]: Rewritten a preexisting variable(0x00000006) with different attributes(0x00000007) - 605DAB50-E046-4300-ABB6-3DD810DD8B23:SbatLevelRT
InstallProtocolInterface: 605DAB50-E046-4300-ABB6-3DD810DD8B23 BBB5BA38
FSOpen: Open '\EFI\fedora\grubaa64.efi' Success
error: ../../grub-core/term/serial.c:217:serial port `com0' isn't found.

error: ../../grub-core/commands/terminal.c:138:terminal `serial' isn't found.

error: ../../grub-core/commands/terminal.c:138:terminal `serial' isn't found.

Use the ^ and v keys to change the selection.                       

      Press 'e' to edit the selected item, or 'c' for a command prompt.   

      Press Escape to return to the previous menu.                               Fedora CoreOS 35.20211029.2.0 (ostree:0)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       The selected entry will be started automatically in 1s.                        The selected entry will be started automatically in 0s.                     EFI stub: Booting Linux Kernel...
...

Whereas in the broken reboot case we boot into Boot0004:

BdsDxe: starting Boot0004 "Fedora" from HD(2,GPT,19C6F85F-EBDB-4F6D-9244-A7355FD46348,0x1000,0x3F800)/\EFI\fedora\shimaa64.efi

Synchronous Exception at 0x00000000BBC3B398

Synchronous Exception at 0x00000000BBC3B398
PC 0x0000BBC3B398
PC 0x0000BBBD7024
PC 0x0000BF6E1650 (0x0000BF6DA000+0x00007650) [ 1] DxeCore.dll
PC 0x0000BF37BF50 (0x0000BF375000+0x00006F50) [ 2] BdsDxe.dll
PC 0x0000BF37F30C (0x0000BF375000+0x0000A30C) [ 2] BdsDxe.dll
PC 0x0000BF6E4D78 (0x0000BF6DA000+0x0000AD78) [ 3] DxeCore.dll
[ 1] /builddir/build/BUILD/edk2-edk2-stable202008/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll
[ 2] /builddir/build/BUILD/edk2-edk2-stable202008/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/BdsDxe/BdsDxe/DEBUG/BdsDxe.dll
[ 3] /builddir/build/BUILD/edk2-edk2-stable202008/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll

  X0 0x00000000BBBD6000   X1 0x00000000BBBE27F8   X2 0x00000000BBC9F388   X3 0x0000000000000018
  X4 0x0000000000000C78   X5 0x000000000002D488   X6 0x0000000000000000   X7 0x00000000BE31714C
  X8 0x0000000000000000   X9 0x0000000000000018  X10 0x0000000000000002  X11 0x0000000000000018
 X12 0x0000000000000002  X13 0x0000000000000018  X14 0x0000000000000001  X15 0x0000000000000000
 X16 0x00000000BF6D98C0  X17 0x00000000FFFFA6AB  X18 0x0000000000000000  X19 0x0000000000000000
 X20 0x0000000000000000  X21 0x00000000BF38E000  X22 0x0000000000000000  X23 0x0000000000000001
 X24 0x00000000BF38E1A8  X25 0x00000000BF38A712  X26 0x0000000000000001  X27 0x0000000000000000
 X28 0x00000000BF38A85C   FP 0x00000000BF6D98A0   LR 0x00000000BBBD7024  

  V0 0xAFAFAFAFAFAFAFAF AFAFAFAFAFAFAFAF   V1 0x63732F6666666666 6666666666666666
  V2 0x41480A302C30406B 7369642F32406973   V3 0x0000000000000000 0000000000000000
  V4 0x0000000040000000 0000000000000000   V5 0x4010040140100401 4010040140100401
  V6 0x0040000000000000 0040000000000000   V7 0x0000000000000000 0000000000000000
  V8 0x0000000000000000 0000000000000000   V9 0x0000000000000000 0000000000000000
 V10 0x0000000000000000 0000000000000000  V11 0x0000000000000000 0000000000000000
 V12 0x0000000000000000 0000000000000000  V13 0x0000000000000000 0000000000000000
 V14 0x0000000000000000 0000000000000000  V15 0x0000000000000000 0000000000000000
 V16 0x0000000000000000 0000000000000000  V17 0x0000000000000000 0000000000000000
 V18 0x0000000000000000 0000000000000000  V19 0x0000000000000000 0000000000000000
 V20 0x0000000000000000 0000000000000000  V21 0x0000000000000000 0000000000000000
 V22 0x0000000000000000 0000000000000000  V23 0x0000000000000000 0000000000000000
 V24 0x0000000000000000 0000000000000000  V25 0x0000000000000000 0000000000000000
 V26 0x0000000000000000 0000000000000000  V27 0x0000000000000000 0000000000000000
 V28 0x0000000000000000 0000000000000000  V29 0x0000000000000000 0000000000000000
 V30 0x0000000000000000 0000000000000000  V31 0x0000000000000000 0000000000000000

  SP 0x00000000BF6D98A0  ELR 0x00000000BBC3B398  SPSR 0x60000205  FPSR 0x00000000
 ESR 0x9600004F          FAR 0x00000000BBC03488

 ESR : EC 0x25  IL 0x1  ISS 0x0000004F

Data abort: Permission fault, third level

Related to https://github.com/coreos/fedora-coreos-tracker/issues/946 perhaps?

dustymabe commented 2 years ago

Thanks for digging into this @jlebon. Previously I was assuming it was a flake, it seems like it's a real problem that intermittently happens.

bgilbert commented 2 years ago

It's expected that we'll boot through fallback on the first boot and an explicit boot entry on subsequent boots. The differing boot path may be exposing a bootloader bug. The fix for #946 will (for new installs) switch both the first-boot and non-first-boot code paths to a path which is in between those two, so it could fix this problem or could introduce it to the first boot.

dustymabe commented 2 years ago

Saw this again today in multi-arch-pipeline/291.

multipath-day1.zip

dustymabe commented 2 years ago

We have some kind of UEFI debug logging going on in the tests, so there's lots of output even in normal boots. We should figure out why that is.

@martinezjavier said "your OVMF (Open Virtual Machine Firmware) build has some debug output enabled"

Regarding this issue as a whole:

I'll try to work on this next week unless someone else wants to pick it up first.

dustymabe commented 2 years ago

I can't seem to reproduce this and we haven't seen it for a few weeks. Currently I've tried:

COUNT=0
while cosa kola run multipath.day1; do echo -e "\nXXX $COUNT XXX\n"; COUNT=$((COUNT+1)); done

Run 90 times with no failure. I've also tried with more "reboot" tests:

COUNT=0
while cosa kola run --parallel=4 multipath.day1 ext.config.reboot ext.config.var-mount.luks ext.config.var-mount.simple ext.config.root-reprovision.filesystem-only ext.config.root-reprovision.swap-before-root ext.config.root-reprovision.raid1 ext.config.root-reprovision.luks ostree.hotfix; do echo -e "\nXXX $COUNT XXX\n"; COUNT=$((COUNT+1)); done

Which has run 33 times with no failure.

Maybe this was a weird issue that's fixed now or maybe it's some odd interaction with gangplank and we'll see it again. The next time we see it, if we see it, let's attempt the reproducer(s) above to see if we can more consistently get a failure.