Tomas-M / linux-live

Linux Live Kit
http://www.linux-live.org/
1.01k stars 252 forks source link

Boot failure on nvme drives #210

Closed blinkenlight closed 2 years ago

blinkenlight commented 2 years ago

After moving to a motherboard-mounted NVMe SSD from my previous SATA SSD, my previously functional Slax install immediately started failing to boot, with the following message:

* Probing for hardware
* Looking for slax data in /slax .................
Fatal error occured - Could not locate slax data

As it has been suggested by @rostok in issue #98, this appears to be a problem cause by blkid which completely fails to mention any of my NVMe partitions when run without parameters; this is obvious when booting with debug enabled, whith only my SATA drives being probed again and again for a slax folder. However, once landed in the post-failure shell, running blkid explicitly pointed at one of the NVMe partitions does return the relevant information correctly and subsequent runs of just "blkid" (no parameters) also start "seeing" all the NVMe partitions after that.

Unfortunately, simply doing the above then executing /init again does not solve the problem completely - while the previous error does not show up again, I get another error instead shortly after, when the script is trying to execute chroot and the second init:

couldn't find an alternative telinit implementation to spawn

...which seems to have been encountered before, in issue #141. Unfortunately, I'm not sure what to do with this one; and the first problem is still an each-boot manual workaround anyway.

Tomas-M commented 2 years ago

What do you mean by "running blkid explicitly pointed at one of the NVMe partitions"? Please explain.

blinkenlight commented 2 years ago

Specifically, at the shell prompt, doing either blkid /dev/nvme0n1p4 or even just blkid /dev/nvme* results in subsequent launches of blkid without any parameters reporting all my NVMe partitions as well, next to the SATA ones. Before doing this, blkid reports only the SATA ones.

Tomas-M commented 2 years ago

This is very interesting. Does cat /proc/partitions before and after blkid change too?

blinkenlight commented 2 years ago

It looks like /proc/partitions does not change. It has ALL my partitions even while blkid is still failing to see them. It looks like this right from the start:

major   minor   #blocks     name
254 0   524288      zram0
259 0   488386584   nvme0n1
259 1   160013312   nvme0n1p1
259 2   160013312   nvme0n1p2
259 3   160013312   nvme0n1p3
259 4   8345600     nvme0n1p4
8   0   234431064   sda
8   1   233308160   sda1
8   16  976762584   sdb
8   17  105381888   sdb1
8   19  870842368   sdb3
11  0   1048575     sr0
11  1   1048575     sr1

The NVMe partitions hold a Win7, a WinX, a Mint OS, and a stand-alone burg bootlader sharing the fourth one with slax, allowing selection of what to boot. The two SATA ones are a strictly-data SSD and HDD. I'm guessing sr0 and sr1 are my two DVD drives. I repeat, /proc/partitions does not change at all before and after.

However, I need to make a small correction. running blkid for a specific NVMe partition, such as blkid /dev/nvme0n1p3 results in only that partition being added to the output of a no-arguments blkid command. Running blkid for another NVMe partition makes that partition to start appearing as well in no-argument runs. However, running blkid /dev/nvme* gets all nvme partitions "noticed" and start appearing in subsequent no-argument blkid runs. Yeah, it's all a bit weird indeed...

Tomas-M commented 2 years ago

It is possible that blkid uses cache. Please see /etc/blkid.tab or /run/blkid/blkid.tab if this file is created or updated.

It is a mystery for me why blkid in Slax does not report all devices found in /proc/partitions (because it should). Maybe the blkid binary is outdated and it should be updated. This may take me some time.

blinkenlight commented 2 years ago

The file /etc/blkid.tab does not exist, and it's not created. The file /run/blkid/blkid.tab does exist right form the start, but initially it only contains entries for the zram, sda and sdb partitions. As I run blkid with a specific NVMe partition, that partition gets added to /run/blkid/blkid.tab, or all of them are added at once when I run blkid for /dev/nvme*. Looks like you're right.

Tomas-M commented 2 years ago

What happens if you remove /run/blkid/blkid.tab prior to running blkid? Does it get created again, and again without nvme devices?

If you are using 64bit Slax, here is a new initramfs to try: It updates the blkid binary (statically compiled with uclibc today) https://slax.org/upload/initrfs.img Please replace the initrfs.img in Slax with this file and try to boot. We will see what happens.

Tomas-M commented 2 years ago

Hm the file cannot be downloaded for unknown reason, try https://slax.org/initrfs.img

blinkenlight commented 2 years ago

if I remove /run/blkid/blkid.tab, it gets recreated as soon as I run blkid again, however still without any of the NVMe partitions. I tested the downloaded initrfs.img as well - unfortunately it behaves exactly identically to the old one. Still fails to boot, blkid still discovers the nvme partitions only after I "show them" to it.

Tomas-M commented 2 years ago

OK I've reverted back to previous blkid version since the update didn't help. Please try this now: https://slax.org/initrfs.img I've added a function to probe all devices found in /proc/partitions So it should automatically boot now.

Tomas-M commented 2 years ago

(make sure to re-download the file actually)

blinkenlight commented 2 years ago

Yup, this new initrfs.img did the trick, Slax now boots without any problems... nicely done!

Tomas-M commented 2 years ago

Thank you for testing, I will add this to next Slax release