bedrocklinux / bedrocklinux-userland

This tracks development for the things such as scripts and (defaults for) config files for Bedrock Linux
https://bedrocklinux.org
GNU General Public License v2.0
603 stars 64 forks source link

GRUB fails to boot with BTRFS or ZFS #200

Open PrimaryCanary opened 3 years ago

PrimaryCanary commented 3 years ago

It is a known issue that the trio of Bedrock Linux+GRUB+BTRFS or ZFS is unreliable. I'd like to see if I can close this issue.

I imagine there will be three steps to squashing this bug:

  1. Figuring out what is causing it
  2. Fixing it upstream
  3. Getting the GRUB team to accept the changes

Step one and three will probably go hand in hand; being able to point the GRUB team to a thorough explanation/investigation (hopefully this issue) will probably get the changes accepted faster.

I've done a little reading about this. I've found these relevant links:

Regarding the first link, I've very briefly looked through GRUB's code and seen some of the things mentioned.

Has anything else been discovered outside of the aforementioned links? Any info about this will surely help get it fixed and accepted.

paradigm commented 3 years ago

It is a known issue that the trio of Bedrock Linux+GRUB+BTRFS or ZFS is unreliable. I'd like to see if I can close this issue.

That'd be great!

I imagine there will be three steps to squashing this bug:

1. Figuring out what is causing it

2. Fixing it upstream

3. Getting the GRUB team to accept the changes

That's how I'm modelling it as well, with one addition: find a way for Bedrock to detect whether or not the fix is in place. Right now Bedrock refuses to hijack a system if it detects the relevant fields in grub.cfg. We should retain that check for people on GRUB releases which retain the issue, but still allow people on newer ones through quietly. It might be as simple as comparing grub-mkconfig --version against the version that includes your fix.

Step one and three will probably go hand in hand; being able to point the GRUB team to a thorough explanation/investigation (hopefully this issue) will probably get the changes accepted faster.

I've done a little reading about this. I've found these relevant links:

* https://bedrocklinux.org/0.7/feature-compatibility.html#grub-btrfs-zfs

* [#190 (comment)](https://github.com/bedrocklinux/bedrocklinux-userland/issues/190#issuecomment-664529733)

* #157

Regarding the first link, I've very briefly looked through GRUB's code and seen some of the things mentioned.

Has anything else been discovered outside of the aforementioned links? Any info about this will surely help get it fixed and accepted.

Props on doing your homework here. Nothing else discovered that I know of; you're as caught up here as I am. The only thing to add here is emphasis on my use of phrases like "appears" and "I think" weakening the confidence of what I wrote. I did some early investigation here before concluding (1) it won't be a quick fix (2) it doesn't require Bedrock-specific knowledge, meaning it's something that should probably be handed off to someone else while I focus on things that require my specific background. It's quite possible my investigation went astray somewhere and my theory about stat() parsing mountinfo and failing to consider the possibility of multiple mounts is incorrect.

I'd be delighted to have this fixed and cannot thank you enough for looking into it.

PrimaryCanary commented 3 years ago

Nothing else discovered that I know of; you're as caught up here as I am.

In that case, I'll ask around the bug-grub mailing list (thread) and report back here. I'll make sure to mention that it is only the running theory.

I'd be delighted to have this fixed and cannot thank you enough for looking into it.

:)

Titaniumtown commented 3 years ago

It's been a while, has anyone made any progress?

PrimaryCanary commented 3 years ago

It's been a while, has anyone made any progress?

I'm still going to complete this but I've been busy with university and computer issues. I'd be surprised if I don't have a pull request open by the end of 2020 and hope to get to it near the end of November.

crazyaccess commented 3 years ago

Still no progress ?

dv-anomaly commented 3 years ago

I'm going to assume @PrimaryCanary never got around to this. I'm going to start looking into getting this working on openSUSE. Can anyone provide some insight into what the extra mounts might look like in /proc/self/mountinfo, and the most reliable way of filtering those entries out?

Sounds like I'll need to make some modifications to bypass the aforementioned check to highjack as well. I'll get a VM setup to start debugging over the weekend.

dv-anomaly commented 3 years ago

Ok, so I dove into it this morning. After compiling grub from source I found that -r / --relative is not a valid argument when running grub-mkrelpath. After trying to parse through the grub documention for a while, I don't think think this exists upstream.

Digging further I found that openSUSE seems to be making a number of patches to GRUB, which can be found here. I'm still trying to parse through all of this, but I'm starting to think this may only be an openSUSE issue because of their implementation of snapper.

dv-anomaly commented 3 years ago

These seem to be the most relevant patches:

https://build.opensuse.org/package/view_file/openSUSE:Factory/grub2/grub2-btrfs-04-grub2-install.patch?expand=1

https://build.opensuse.org/package/view_file/openSUSE:Factory/grub2/0001-Unify-the-check-to-enable-btrfs-relative-path.patch?expand=1

I believe the original analysis of why this is failing to be accurate. However, it appears bedrock's sanity check will be invalid across other distributions. I also don't have a clear understanding of why SUSE made these changes yet. I think it is likely related to snapper. Is it possible this is non-issue on other distros with bedrock + btrfs? Has that ever been tested?

paradigm commented 3 years ago

@BannedPatriot There's two GRUB issues with Bedrock: one specific to OpenSUSE's patches, and one with vanilla/upstream GRUB. This thread, as I understood it, was about the latter item, while you appeared to have tracked down the former. These were two separate items in Bedrock's documentation, but apparently the OpenSUSE specific bit was merged into the other one, which is certainly confusing. I'll see if I can fix that when I get the chance.

The OpenSUSE item was reported here. A similar path to yours was followed, including the conclusion that this was due to non-standard patches. As discussed in that issue, I added a check for this situation to Bedrock's installer. It takes the steps to actually reproduce the issue and avoids false positives.

The general GRUB issue was reported here. Note this was on Manjaro, not OpenSUSE, and per the discussion there this is distinct from the OpenSUSE issue. Also note this is what PrimaryCanary links to in the original post for this issue. I could not reproduce this issue consistently; it seems finicky, possibly due to inconsistent mount point ordering. Bedrock adds a check for the known scenario here. Due to the fact it sometimes doesn't occur I could not rely on the installer to reproduce the issue to confirm it is present, and so this check will false positive, which is why I requested PrimaryCanary - or whomever else looks into this - also find a way to detect if a hypothetical future fix is in place.

I consider the general GRUB issue a higher priority, as it affects more distros, and may affect OpenSUSE even if the OpenSUSE-specific one is fixed. However, I'd be delighted to see progress on either, let alone both.

tiziodcaio commented 3 years ago

Any solutions while the problem will be fixed? Refind, for example will works?

paradigm commented 3 years ago

Any solutions while the problem will be fixed?

The GRUB bug is known to manifest with either of these combinations of things:

Change any one of those three (except swapping BTRFS and ZFS for each other) and you're good.

Refind, for example will works?

Yes, swapping GRUB for rEFInd is fine. It appears the underlying issue is a bug in GRUB. If you don't use GRUB you will not be hit by the bug.

dv-anomaly commented 3 years ago

I'm still looking into this. The original theory I had for a solution isn't working. I have grub building with openSUSE's patches, but I'm a little out of my depth with what's going on in this part of the code. We have a pretty good idea on why it's failing, but I'm struggling to come up with a good solution. Perhaps looking at how some of these other boot loaders function can help. If anyone has some ideas please let me know.

For what it's worth, the inline comments in this part of the code seem to acknowledge their btrfs implementation is a bit of a hack.

paradigm commented 2 years ago

In general namespacing is actively counter-productive for Bedrock's needs, as Bedrock is about integrating things, not segregating them. In general I've been avoiding it in favor of other solutions where I could find them. However, it's been slowly dawning on me that a lot of Bedrock's compatibility issues are from software that make the assumption mount namespaces are in use. This includes, notably, GRUB. The GRUB bug here, as I understand it, resolves around the assumption it will see just one /boot mount.

Per-stratum namespaces may solve both this and other issues. There will be some unpleasant trade-offs if we go this route- it's not a straight win - but it's worth serious consideration.

naY9yjoS6ZqhOd35sIFH commented 1 year ago

I have tested and fix by these step: first according #251 to set share /usr/src for dkms compile ZFS kernel module then using dracut or genkernel which is reliable initramfs generate tool to create one initramfs for next step then set /etc/default/gru bGRUB_CMDLINUX_LINUX to correct ZFS dataset path or btrfs subvolume such as root=ZFS=tank/ROOT/default or rootflags=subvol=rootfs and finally using grub-mkconfig to correct linux kernel parameter to mount root on ZFS/btrfs right

i can prove it by import zvol disk image and boot

paradigm commented 1 year ago

I'm not sure your proof is sufficient; the issue is finicky and doesn't manifest consistently, and so even several successful grub-mkconfig runs without the issue doesn't necessarily indicate the issue isn't still lingering there.

That said, manually setting these parameters /etc/default/grub and completely removing whatever hooks grub-mkconfig uses to incorrectly auto-populate makes intuitive sense to me. It's not clear to me that Bedrock can automate doing this for a broad swath of users and potential setups, but it may indeed be adequate for users with enough background to apply this themselves. More investigation is needed here, I think.