fedora-silverblue / issue-tracker

Fedora Silverblue issue tracker
https://fedoraproject.org/atomic-desktops/silverblue/
122 stars 3 forks source link

F37 Silverblue all new updates cause boot failure, dropped to dracut emergency shell #426

Closed Fooze666 closed 1 year ago

Fooze666 commented 1 year ago

Describe the bug Any update applied to my Fedora 37 Silverblue system since Feb 21 2023 fails to boot and drops to the dracut emergency shell. I have tried rebasing to F38 beta to see if it might help but experience the same behavior.

To Reproduce Please describe the steps needed to reproduce the bug:

  1. run command 'rpm-ostree update'
  2. run command 'systemctl reboot'

Expected behavior Successful system boot after updating to new deployment

OS version:

Fedora 37 Silverblue
State: idle
BootedDeployment:
● fedora:fedora/37/x86_64/silverblue
                  Version: 37.20230221.0 (2023-02-21T00:44:37Z)
               BaseCommit: 521691fcb10306daf330ad64fa04ad23097a05b5adaebe2f1ef4c170c679b9e8
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: borgbackup gnome-tweaks jetbrains-mono-fonts langpacks-en openssl protonvpn-cli
                   Pinned: yes

Additional context The version listed above in the rpm-ostree output is the only deployment that will boot for me. Every update I've tried after that point fails to boot. I am attaching a copy of the rdsosreport.txt file generated by dracut. rdsosreport.txt

miabbott commented 1 year ago

From the rdsosreport file, this looks concerning:

[  130.564030] fedora dracut-initqueue[478]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[  130.566165] fedora dracut-initqueue[478]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f39b869c0-a50f-422b-a804-3ba892c413db.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[  130.566165] fedora dracut-initqueue[478]:     [ -e "/dev/disk/by-uuid/39b869c0-a50f-422b-a804-3ba892c413db" ]
[  130.566165] fedora dracut-initqueue[478]: fi"

Looks like the disk at /dev/disk/by-uuid/39b869c0-a50f-422b-a804-3ba892c413db is not starting up successfully

Fooze666 commented 1 year ago

That's what's strange. The disk obviously works. I'm using the system now to wrtie this. This behavior only happens if I try to use a newer deployment than the one I am currently running (listed above)

miabbott commented 1 year ago

You might have to step through the commits in the ostree repo to find out where in the history things started breaking.

For example:

$ ostree refs | grep fedora
fedora:fedora/37/x86_64/silverblue
$ sudo ostree pull --commit-metadata-only --depth=10 fedora:fedora/37/x86_64/silverblue

GPG: Verification enabled, found 1 signature:

  Signature made Wed 01 Mar 2023 09:22:05 PM EST using RSA key ID F55AD3FB5323552A
  Good signature from "Fedora <fedora-37-primary@fedoraproject.org>"
Receiving metadata objects: 1/(estimating) 592 bytes/s 592 bytes                                                                                                                                                                                                

GPG: Verification enabled, found 1 signature:

  Signature made Tue 28 Feb 2023 07:45:18 PM EST using RSA key ID F55AD3FB5323552A
  Good signature from "Fedora <fedora-37-primary@fedoraproject.org>"

...

$ ostree log fedora:fedora/37/x86_64/silverblue                                                                                            
commit 1e8a1ad2c612d98ee1180cb34c58b7f916e5cd99fe2cec453b55828764ea56bd
Parent:  5edb0e8ec183ac2b2e3d7463a8383f38b62992ebdb7cbb4cdf15621ce36eec60
ContentChecksum:  c488c4fe3501a124c6e415d5a3e048aeaeec78aca89890eb152c54b8624cdef5
Date:  2023-03-02 02:21:58 +0000
Version: 37.20230302.0
(no subject)

commit 5edb0e8ec183ac2b2e3d7463a8383f38b62992ebdb7cbb4cdf15621ce36eec60
Parent:  458a0650076eb79d6b34a34587638eec2437f1349bcfd8c59a1575c2c5bad87a
ContentChecksum:  7a6ebdcce0642e5aa2a6d5f10985f2732d7420df823a6dfc51d15b1b770ef3bf
Date:  2023-03-01 00:45:13 +0000
Version: 37.20230301.0
(no subject)

..

From there you can inspect the differences between the commits. Comparing your working commit with the one right after it:

$  rpm-ostree db diff 521691fcb10306daf330ad64fa04ad23097a05b5adaebe2f1ef4c170c679b9e8 1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c
ostree diff commit from: 521691fcb10306daf330ad64fa04ad23097a05b5adaebe2f1ef4c170c679b9e8
ostree diff commit to:   1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c
Upgraded:
  conmon 2:2.1.5-1.fc37 -> 2:2.1.6-3.fc37
  distribution-gpg-keys 1.82-1.fc37 -> 1.84-1.fc37
  ibus-m17n 1.4.18-1.fc37 -> 1.4.19-1.fc37
  podman 5:4.4.1-1.fc37 -> 5:4.4.1-3.fc37
  podman-gvproxy 5:4.4.1-1.fc37 -> 5:4.4.1-3.fc37
  rav1e-libs 0.5.1-6.fc37 -> 0.5.1-9.fc37
  sane-backends 1.1.1-10.fc37 -> 1.2.1-1.fc37
  sane-backends-drivers-cameras 1.1.1-10.fc37 -> 1.2.1-1.fc37
  sane-backends-drivers-scanners 1.1.1-10.fc37 -> 1.2.1-1.fc37
  sane-backends-libs 1.1.1-10.fc37 -> 1.2.1-1.fc37

Nothing stands out as a package that would cause a problem.

You could keep diffing the commits until you find a likely culprit. Or start trying to deploy each one to find the breakage:

$ sudo rpm-ostree deploy 1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c
Validating checksum '1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c'
1 metadata, 0 content objects fetched; 592 B transferred in 1 seconds; 0 bytes content written
⠁ Receiving objects; 98% (5417/5514) 11.4 MB/s 261.3 MB                                                                                                                                                                                                                                                                       1251 metadata, 4263 content objects fetched; 266145 KiB transferred in 24 seconds; 436.8 MB content written
Receiving objects; 98% (5417/5514) 11.4 MB/s 261.3 MB... done
Checking out tree 1b15965... done
...
miabbott commented 1 year ago

Could you also share journalctl -b from your system when it successfully boots?

Fooze666 commented 1 year ago

I'm attaching the output of journalctl -b on a successful boot. I will try to step through the commits during the day today and see if I can find exactly where the problem begins. successfulboot.txt

miabbott commented 1 year ago

Thanks for the successful boot logs. Stepping through the boot process from the failed + successful scenarios, I found this difference:

Successful:

Mar 03 05:31:35 fedora kernel: ata2: SATA link down (SStatus 0 SControl 300)
Mar 03 05:31:35 fedora kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 03 05:31:35 fedora kernel: ata1.00: ATA-10: WDC WD10SPZX-22Z10T1, 04.01A04, max UDMA/133
Mar 03 05:31:35 fedora kernel: ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
Mar 03 05:31:35 fedora kernel: ata1.00: Features: NCQ-prio
Mar 03 05:31:35 fedora kernel: ata1.00: configured for UDMA/133
Mar 03 05:31:35 fedora kernel: scsi 0:0:0:0: Direct-Access     ATA      WDC WD10SPZX-22Z 1A04 PQ: 0 ANSI: 5
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] 4096-byte physical blocks
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] Write Protect is off
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 03 05:31:35 fedora kernel: sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes
Mar 03 05:31:36 fedora kernel:  sda: sda1 sda2 sda3 sda4 sda5 sda6 sda7 sda8
Mar 03 05:31:36 fedora kernel: sd 0:0:0:0: [sda] Attached SCSI disk

Failed:

[    2.119564] fedora kernel: ata2: SATA link down (SStatus 4 SControl 300)
[    2.120201] fedora kernel: ata1: SATA link down (SStatus 4 SControl 300)

Further confirming that a change in the ostree content caused your disk to stop being discovered; though this appears to show that link to the disk is not being activated properly in the kernel. So my gut says there is a kernel that landed sometime after 2023-02-21 which may have changes that broke your environment.

Fooze666 commented 1 year ago

After moving through the commits one by one, I get the following results:

The following 3 commits will boot successfully:

commit 3b45c7b29b62b40e2f4dfca4a4fabaab17edb321f569f7d40e060d72653dc91d Parent: 1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c ContentChecksum: d4b9bbc40203bb0b6c5875f3a1e3051bee047a7114c4cad3cd78a7968c993687 Date: 2023-02-23 01:27:17 +0000 Version: 37.20230223.0 (no subject)

commit 1b159658aefe936a0b2f4415810fdf0e7c5cf79815d458e680fe1959379e995c Parent: 521691fcb10306daf330ad64fa04ad23097a05b5adaebe2f1ef4c170c679b9e8 ContentChecksum: 051ddc39e6fffebc2c3cb86ed4b295d6b76ab9ad7f1e030e502ff29472c4f2c1 Date: 2023-02-22 09:29:08 +0000 Version: 37.20230222.0 (no subject)

commit 521691fcb10306daf330ad64fa04ad23097a05b5adaebe2f1ef4c170c679b9e8 Parent: 43b8be80d5b3a367d2d4ea8ee2f3e4eff0bbf76c616c3af5c52246f104cb3562 ContentChecksum: 9007b1d64745219321d365f50408bc9cdb8af7cc183e732e689385a0242488b9 Date: 2023-02-21 00:44:37 +0000 Version: 37.20230221.0 (no subject)

When I try anything after the Feb 23 commit, I start getting the boot failures. I tried the following two commits and both fail to boot in the way I described in the original issue.

commit 025246791ade1e8e8b2975a48646295bbcb2e3d42744500de0d53f38fb213570 Parent: 06f3b434db34845102453d4b556f84ba664e982d5c933f164f2ca31985f352a9 ContentChecksum: cc08b4551540c3bf5a16386884ed02f95cc964e5f0fd1dc3334a187ae78e42cb Date: 2023-02-25 02:57:48 +0000 Version: 37.20230225.0 (no subject) Failure

commit 06f3b434db34845102453d4b556f84ba664e982d5c933f164f2ca31985f352a9 Parent: 3b45c7b29b62b40e2f4dfca4a4fabaab17edb321f569f7d40e060d72653dc91d ContentChecksum: 5089939d782d0681eab29e894975d3a9a4c871f2cb4c5fa11dc93417bf584337 Date: 2023-02-24 03:01:58 +0000 Version: 37.20230224.0 (no subject) Failure

I'm not sure what other information I can offer, but I'll gladly try to collect anything that might help with the troubleshooting.

miabbott commented 1 year ago
$ rpm-ostree db diff 3b45c7b29b62b40e2f4dfca4a4fabaab17edb321f569f7d40e060d72653dc91d 06f3b434db34845102453d4b556f84ba664e982d5c933f164f2ca31985f352a9
ostree diff commit from: 3b45c7b29b62b40e2f4dfca4a4fabaab17edb321f569f7d40e060d72653dc91d
ostree diff commit to:   06f3b434db34845102453d4b556f84ba664e982d5c933f164f2ca31985f352a9
Upgraded:
  kernel 6.1.12-200.fc37 -> 6.1.13-200.fc37
  kernel-core 6.1.12-200.fc37 -> 6.1.13-200.fc37
  kernel-modules 6.1.12-200.fc37 -> 6.1.13-200.fc37
  kernel-modules-extra 6.1.12-200.fc37 -> 6.1.13-200.fc37
  python-unversioned-command 3.11.1-3.fc37 -> 3.11.2-1.fc37
  python3 3.11.1-3.fc37 -> 3.11.2-1.fc37
  python3-libs 3.11.1-3.fc37 -> 3.11.2-1.fc37

That's the diff between the Feb 23 commit and the Feb 24 commit. The kernel is in there and seems to give more support for my theory that the kernel was involved.

You could look at the commits for the 6.1.13 kernel that fedora shipped here - https://lore.kernel.org/all/20230220133600.368809650@linuxfoundation.org/

...cross reference them here - https://gitlab.com/cki-project/kernel-ark/-/commits/linux-6.1.y

...which might help identify where it broke.

But I think the best thing to do now is file a BZ against the Fedora kernel with your evidence - https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=kernel

Fooze666 commented 1 year ago

Thank you so much for your help @miabbott

I will try to organize everything and take this over to Bugzilla.

Fooze666 commented 1 year ago

For the sake of anyone who might be having similar trouble and is following this thread, the issue has been filed through Bugzilla at this link: https://bugzilla.redhat.com/show_bug.cgi?id=2175529

travier commented 1 year ago

Closing this one as things will happen in Bugzilla and there is not much for us to do here anymore.