Open anonymouse64 opened 4 years ago
@anonymouse64 I have taken a look at this, but couldn't reproduce exactly. If I create a kernel image with dd if=/dev/zero of=kernel.efi count=1
(512 null bytes), the error I get is
Failed to boot both default and fallback entries.
Press any key to continue...
and then I get to the grub menu, where I need to force a reboot to move forward and get the kernel reverted. To me, the double free
seems like a grub bug while doing something with kernel.efi
, that would need to be solved first, and then we would probably get the "Failed to boot both..." error. It would be nice if you can share the kernel.efi
that triggered the issue.
Anyway, I tried grub's fallback feature with this grub configuration:
https://gist.github.com/alfonsosanchezbeato/8ad42596cd06a5986a9aea16c815f841
which seems to work. When trying the kernel.efi
created by dd
, it is able to boot the old kernel without interactions with the user (grub menu is not shown), and the new kernel snap is reverted as expected. I think we can have this change in snapd without backwards compatibility problems. More sophisticated things that do not apply only when updating the kernel would need a bit more of thought, as you mentioned.
Probably we should go through some cases with bad kernel images and give that some testing in the spread tests:
and probably some more...
@alfonsosanchezbeato thanks for testing that, the grub.cfg you posted looks good to me.
I don't have the original kernel.efi which triggered that double free anymore, I don't recall what I did to cause that, I may have just copied a junk file to the kernel.efi rather than all zeros like the dd
command.
All of those cases would be good, we do have this existing spread test which could be modified to handle these additional cases I think: https://github.com/snapcore/snapd/blob/master/tests/nested/core/core20-kernel-failover/task.yaml
Though it is worth wondering whether the full test situations should live in snapd repo or somewhere else... as it seems all those other variants are more about grub's behavior than snapd's, but then again we do have grub.cfg inside snapd now so I suppose it makes sense.
If I intentionally break the kernel.efi[1] in ubuntu-boot on a UC20 system, then grub does not reboot the system and instead hangs with the following message:
Additionally, pressing a key here (at least in QEMU) will drop me to the UEFI menu, where I would then have to manually reboot.
It would be great if grub could reboot automatically in this situation, because this may happen in the wild with a bad kernel snap update (where try-kernel.efi is corrupted like this) and then, if the system were rebooted, the grub.cfg on ubuntu-boot would trigger a rollback to the previous kernel.efi as without user intervention.
[1] I ran
if=/dev/zero of=kernel.efi count=1