Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
625 stars 56 forks source link

GRUB: Extent not found after running bees #249

Open hotburger opened 1 year ago

hotburger commented 1 year ago

GRUB gives me this error after running bees for a few hours. It is consistently doing it every time I run bees for enough time. I fix it temporarily by reinstalling the kernel package from chroot. I'm assuming bees is deduping the kernel, which grub doesn't like? Strangely this doesn't happen to my arch install on the same partition.

I assume that gentoo is storing another copy of the kernel somewhere while arch doesn't. The only other difference from my arch install is a separate subvol for /boot.

grub error message:

Loading Linux 6.1.11-gentoo-dist ...
error: extent not found.
Loading initial ramdisk ...
error: you need to load the kernel first.

Press any key to continue...
Zygo commented 1 year ago

Some experiments to try to collect more information:

  1. Run btrfs-search-metadata file /path/to/vmlinuz (from python-btrfs package) before and after the failure (i.e. once after reinstalling, and once again when boot fails).
  2. Does it also fail when making a reflink of the kernel, e.g. cp --reflink=always /path/to/vmlinuz /root/foo and then reboot?

I don't know how grub would distinguish one reflink to a file from another, much less be fatally broken by it, so I expect experiment 2 will not trigger a grub failure, and we'll see some anomalous feature (non-zero extent offsets? unsupported compression type? hole in kernel file?) from experiment 1.

Hopefully we get some information that can be turned into an actionable grub bug report.

hotburger commented 1 year ago

This issue stopped happening for a while, so I couldn't replicate it to gather info. It is happening again though. Creating a reflink did not cause the boot to fail. vmlinuz-6.2.7-broken.log vmlinuz-6.2.7-fixed.log

Zygo commented 1 year ago

Looks like this is fixed in grub but not released yet:

https://git.savannah.gnu.org/cgit/grub.git/commit/?id=7f4e017a1416bcbdca16de4f923679ec9f003171

Jorropo commented 10 months ago

I had similar boot issues in versions of grub that supposedly have this fixed (it would panic in various random ways), I switched to a 3 partition layout with:

Which works around the problems.

Seems like grub's btrfs implementation is not very good yet.

Trayshar commented 6 months ago

I have the same problem on manjaro, using kernel 6.6.8-2-MANJARO and grub 2.12. Before entering the grub menu I get this error:

error: start_image() returned 0x800000000000000001.

Failed to boot both default and fallback entries.

Press any key to continue...

I can get into the grub menu after that, but trying to boot results in error: you need to load the kernel first. and the system freezes...

I am now successfully using @Jorropo's workaround

PfannenHans commented 4 months ago

I can confirm this on two separate machines running Arch. Here it is usually the amd-ucode.img that gets broken and gives the error: premature end of file. The systems boot if i remove it from the boot entry in Grub. Chrooting into the installation and reinstalling the ucode also fixes it temporarily.

kakra commented 4 months ago

You can set the boot directory chattr +C before reinstalling the boot loader and see if that helps. bees won't touch file extents created with this flag on, IOW, setting the flag on already existing files changes nothing. New files will inherit the flag from the directory. But this also removes checksum protection from your boot files, so it can only work as a temporary work-around.