Closed jlebon closed 2 years ago
Thanks for digging into this @jlebon. Previously I was assuming it was a flake, it seems like it's a real problem that intermittently happens.
It's expected that we'll boot through fallback on the first boot and an explicit boot entry on subsequent boots. The differing boot path may be exposing a bootloader bug. The fix for #946 will (for new installs) switch both the first-boot and non-first-boot code paths to a path which is in between those two, so it could fix this problem or could introduce it to the first boot.
Saw this again today in multi-arch-pipeline/291.
We have some kind of UEFI debug logging going on in the tests, so there's lots of output even in normal boots. We should figure out why that is.
@martinezjavier said "your OVMF (Open Virtual Machine Firmware) build has some debug output enabled"
Regarding this issue as a whole:
mokutil --set-verbosity true
to enable debug output for shim
I'll try to work on this next week unless someone else wants to pick it up first.
I can't seem to reproduce this and we haven't seen it for a few weeks. Currently I've tried:
COUNT=0
while cosa kola run multipath.day1; do echo -e "\nXXX $COUNT XXX\n"; COUNT=$((COUNT+1)); done
Run 90 times with no failure. I've also tried with more "reboot" tests:
COUNT=0
while cosa kola run --parallel=4 multipath.day1 ext.config.reboot ext.config.var-mount.luks ext.config.var-mount.simple ext.config.root-reprovision.filesystem-only ext.config.root-reprovision.swap-before-root ext.config.root-reprovision.raid1 ext.config.root-reprovision.luks ostree.hotfix; do echo -e "\nXXX $COUNT XXX\n"; COUNT=$((COUNT+1)); done
Which has run 33 times with no failure.
Maybe this was a weird issue that's fixed now or maybe it's some odd interaction with gangplank and we'll see it again. The next time we see it, if we see it, let's attempt the reproducer(s) above to see if we can more consistently get a failure.
rpmostree.install-uninstall.zip
We have some kind of UEFI debug logging going on in the tests, so there's lots of output even in normal boots. We should figure out why that is.
Anyway, on the reboot where we failed, I see this bit is different:
Also this bit:
Which on a normal successful boot was:
So there's a new "Fedora" boot entry Boot0004 on the broken boot.
On a normal successful boot, we select Boot0002:
Whereas in the broken reboot case we boot into Boot0004:
Related to https://github.com/coreos/fedora-coreos-tracker/issues/946 perhaps?