QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
533 stars 46 forks source link

First boot after installing never finishes ("A start job is running for Monitoring of LVM2 mirrors, ...") #7335

Open sjvudp opened 2 years ago

sjvudp commented 2 years ago

How to file a helpful issue

Qubes OS release

4.1.0

Brief summary

Booted installer, perpared the disks, ran installation.

All fine without any error. Then on boot from harddisk nothing happens, i.e. initrd never "switches root"

Steps to reproduce

Expected behavior

Installation continues

Actual behavior

Boot never finishes (see attached screen photo). index

Examining the journal of the failed boots, I found this:

Feb 24 23:52:19 dom0 lvm[1326]:   Device open /dev/sdd1 8:49 failed errno 2
Feb 24 23:52:20 dom0 kernel:  md124: p1
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:1 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:17 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:33 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:49 with no paths.
Feb 24 23:52:20 dom0 dmeventd[3705]: dmeventd ready for processing.
Feb 24 23:52:20 dom0 kernel: lvm[1326]: segfault at 801 ip 0000777003fcfdde sp 00007ffd4db1c028 error 4 in libc-2.31.so[777003e91000+150000]
Feb 24 23:52:20 dom0 kernel: Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21>
Feb 24 23:52:20 dom0 kernel: audit: type=1701 audit(1645743140.034:101): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1326 comm="lvm" exe="/usr/sbin/lvm" sig=11 res=1
Feb 24 23:52:20 dom0 audit[1326]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1326 comm="lvm" exe="/usr/sbin/lvm" sig=11 res=1
Feb 24 23:52:20 dom0 lvm[3705]: Monitoring thin pool qubes_dom0-pool00-tpool.
Feb 24 23:52:20 dom0 lvm[2561]:   3 logical volume(s) in volume group "qubes_dom0" now active
Feb 24 23:52:20 dom0 systemd[1]: Finished LVM event activation on device 253:0.

That segfault doesn't look good!

The last things that seem to happen on boot are:

Feb 24 23:52:22 dom0 systemd[1]: Finished udev Wait for Complete Device Initialization.
Feb 24 23:52:22 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udev-settle comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 kernel: audit: type=1130 audit(1645743142.001:103): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udev-settle comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=>
Feb 24 23:52:22 dom0 systemd[1]: Starting Activation of DM RAID sets...
Feb 24 23:52:22 dom0 systemd[1]: dmraid-activation.service: Succeeded.
Feb 24 23:52:22 dom0 systemd[1]: Finished Activation of DM RAID sets.
Feb 24 23:52:22 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 kernel: audit: type=1130 audit(1645743142.797:104): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su>
Feb 24 23:52:22 dom0 kernel: audit: type=1131 audit(1645743142.797:105): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su>

A key indicator might be "kernel: md124: p1" that could mean mdadm was built without IRST support: Usually there are four md-devices (two per IRST RAID: one real RAID and one pseudo RAID (IRST)). Maybe LVM chokes on the pseudo RAID.

Additional info (that might be important)

The system has two IRST software RAID1 (one for Windows, one for Linux, but none for Qubes OS (that is on a different non-RAID disk)

sjvudp commented 2 years ago

I wonder: Could I enable core dumps easily in the system, and would it be helpful to have the actual core dump? Or have a different lvm to replace the one that is failing?

ghost commented 2 years ago

can you provide how you do custom partition ?

sjvudp commented 2 years ago

can you provide how you do custom partition ?

Basically I was following https://www.qubes-os.org/doc/custom-install/ with the exception that I skipped partitioning as the correct partitions did exist already. So I started from creating a new LUKS device on an existing partition.

ghost commented 2 years ago

Mostly it because you have wrong partition / bad lvm config, maybe try zero disk first, then follow the guide again.

sjvudp commented 2 years ago

Mostly it because you have wrong partition / bad lvm config, maybe try zero disk first, then follow the guide again.

I'm unsure whether speculation is the best way to solve the issue: I can boot Windows 10 without a problem, I can boot Linux (openSUSE Laap 15.3) without a problem, I can boot Tails without a problem, I can boot Qubes OS 4.0 without a problem. So I might conclude that there is no problem with the four disks in the PC (sda-sdd). Zeroing the disks without a really good reason is like those installation instructions for an MS-DOS software back in the 1990ies that started with "format c:"...

sjvudp commented 2 years ago

Here are some details: The disk's partitions are:

# fdisk -l /dev/sdg
Disk /dev/sdg: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Disk model: Generic         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x36a23d8a

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sdg1  *         2048   1435647   1433600   700M 83 Linux
/dev/sdg2         1435648 211150847 209715200   100G 83 Linux
/dev/sdg3       211150848 630581247 419430400   200G 83 Linux
/dev/sdg4       630581248 976773167 346191920 165.1G da Non-FS data

Partition1 and 2 are the relevant ones:

# blkid /dev/sdg*
/dev/sdg: PTUUID="36a23d8a" PTTYPE="dos"
/dev/sdg1: LABEL="Boot" UUID="f739b964-640b-4381-b47b-0c8b74bb69ee" TYPE="ext4" PARTUUID="36a23d8a-01"
/dev/sdg2: UUID="a10e21f9-2581-47f7-819a-ec06fde599a1" TYPE="crypto_LUKS" PARTUUID="36a23d8a-02"
/dev/sdg3: UUID="ce6f1f45-e9a8-4609-9b55-c4ee7eeb2938" TYPE="crypto_LUKS" PARTUUID="36a23d8a-03"
/dev/sdg4: PARTUUID="36a23d8a-04"

The corresponding command line from GRUB is: module2 /vmlinuz-5.10.90-1.fc32.qubes.x86_64 placeholder root=/dev/mapper/qubes_dom0-root ro rd.luks.uuid=luks-a10e21f9-2581-47f7-819a-ec06fde599a1 rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_dom0/swap plymouth.ignore-serial-consoles i915.alpha_support=1 rd.driver.pre=btrfs rhgb quiet I can decrypt and mount the root volume without a problem (from Tails):

# cryptsetup luksOpen /dev/sdg2 crypt
Enter passphrase for /dev/sdg2: 
# vgs
  VG         #PV #LV #SN Attr   VSize  VFree
  qubes_dom0   1   3   0 wz--n- 99.98g    0 
# pvs
  PV                VG         Fmt  Attr PSize  PFree
  /dev/mapper/crypt qubes_dom0 lvm2 a--  99.98g    0 

# lvs
  LV     VG         Attr       LSize  Pool   Origin Data%  Meta%  Move Log Cpy%Sync Convert
  pool00 qubes_dom0 twi-aotz-- 89.80g               10.53  15.48                           
  root   qubes_dom0 Vwi-a-tz-- 98.80g pool00        9.57                                   
  swap   qubes_dom0 -wi-a----- 10.00g                                           
# umount /mnt
# mount /dev/qubes_dom0/root /mnt
# ls -l /mnt
total 76
lrwxrwxrwx    1 root root     7 Jan 28  2020 bin -> usr/bin
drwxr-xr-x    2 root root  4096 Feb 24 23:42 boot
drwxr-xr-x    2 root root  4096 Feb 24 23:42 dev
drwxr-xr-x  102 root root  4096 Mar  6 01:00 etc
drwxr-xr-x    3 root root  4096 Feb 24 23:47 home
lrwxrwxrwx    1 root root     7 Jan 28  2020 lib -> usr/lib
lrwxrwxrwx    1 root root     9 Jan 28  2020 lib64 -> usr/lib64
drwx------.   2 root root 16384 Feb 24 23:42 lost+found
drwxr-xr-x    2 root root  4096 Jan 28  2020 media
drwxr-xr-x    2 root root  4096 Jan 28  2020 mnt
drwxr-xr-x    2 root root  4096 Jan 28  2020 opt
drwxr-xr-x    2 root root  4096 Feb 24 23:42 proc
dr-xr-x---    2 root root  4096 Feb 24 23:49 root
drwxr-xr-x    2 root root  4096 Feb 24 23:42 run
lrwxrwxrwx    1 root root     8 Jan 28  2020 sbin -> usr/sbin
drwxr-xr-x    6 root root  4096 Feb 24 23:44 srv
drwxr-xr-x    2 root root  4096 Feb 24 23:42 sys
drwxrwxrwt    2 root root  4096 Feb 24 23:48 tmp
drwxr-xr-x   12 root root  4096 Feb 24 23:43 usr
drwxr-xr-x   18 root root  4096 Feb 24 23:43 var
sjvudp commented 2 years ago

Running Qubes OS 4.0, my four IRST disks look like this:

[root@dom0 master]# cat /proc/mdstat 
Personalities : [raid1] 
md124 : active (auto-read-only) raid1 sdc[1] sdd[0]
      976748544 blocks super external:/md125/0 [2/2] [UU]

md125 : inactive sdc[1](S) sdd[0](S)
      6306 blocks super external:imsm

md126 : active (auto-read-only) raid1 sdb[1] sda[0]
      1953497088 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive sdb[1](S) sda[0](S)
      6306 blocks super external:imsm

unused devices: <none>

And here are the details:

[root@dom0 master]# mdadm -E /dev/md124
/dev/md124:
   MBR Magic : aa55
Partition[0] :    973078528 sectors at         2048 (type 07)
[root@dom0 master]# mdadm -E /dev/md125
/dev/md125:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 3c238f66
         Family : 3c238f66
     Generation : 0022ca9e
     Attributes : All supported
           UUID : ac1aeea1:1f4da1eb:dd658400:4fb4228b
       Checksum : 34e02dc2 correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : JA10001F1V5JGM
          State : active
             Id : 00000004
    Usable Size : 1953518862 (931.51 GiB 1000.20 GB)

[RAID1_1T]:
           UUID : 7722150f:300ac03c:a80fc2e1:9e86eeaf
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 1
     Array Size : 1953497088 (931.50 GiB 1000.19 GB)
   Per Dev Size : 1953497352 (931.50 GiB 1000.19 GB)
  Sector Offset : 0
    Num Stripes : 7630848
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean

  Disk00 Serial : JA10001F1V3JDM
          State : active
             Id : 00000003
    Usable Size : 1953518862 (931.51 GiB 1000.20 GB)
[root@dom0 master]# mdadm -E /dev/md126
/dev/md126:
   MBR Magic : aa55
Partition[0] :    524288000 sectors at         2048 (type 07)
Partition[1] :   1073741824 sectors at    524290048 (type 07)
Partition[2] :   1073741824 sectors at   1598031872 (type 07)
Partition[3] :   1235220480 sectors at   2671773696 (type 05)
[root@dom0 master]# mdadm -E /dev/md127
/dev/md127:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 2b19b6dc
         Family : 2b19b6dc
     Generation : 00158e2d
     Attributes : All supported
           UUID : 87deeede:0bdbdaed:d994ce4d:e2ad4cf8
       Checksum : 56c1ecdc correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk00 Serial : WD-WCC4N7KUR3XK
          State : active
             Id : 00000001
    Usable Size : 3907022862 (1863.01 GiB 2000.40 GB)

[RAID1_2T]:
           UUID : fe1588c4:a2140388:5b4117ba:4e4339b9
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
     Array Size : 3906994176 (1863.00 GiB 2000.38 GB)
   Per Dev Size : 3906994440 (1863.00 GiB 2000.38 GB)
  Sector Offset : 0
    Num Stripes : 15261696
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean

  Disk01 Serial : WD-WCC4M3PEVR3R
          State : active
             Id : 00000000
    Usable Size : 3907022862 (1863.01 GiB 2000.40 GB)
[root@dom0 master]# 
sjvudp commented 2 years ago

While no fixed installation medium is available, I'd suggest to add something helpful to https://www.qubes-os.org/doc/installation-troubleshooting/ (https://www.qubes-os.org/doc/installation-troubleshooting/#warning-dracut-initqueue-timeout---starting-timeout-scripts-during-installation isn't helpful as it refers to the initial installation boot, not the first boot after installation (where the VMs are configured))

ghost commented 2 years ago

try with single disk first :

cryptsetup -c aes-xts-plain64 -h sha512 -s 512 luksFormat /dev/sda2 crypsetup luksOpen /dev/sda2 luks pvcreate /dev/mapper/luks vgcreate qubes_dom0 /dev/mapper/luks lvcreate -n swap -L 4G qubes_dom0 lvcreate -T -L 30G qubes_dom0/root-pool lvcreate -T -l +95%FREE qubes_dom0/vm-pool lvs lvcreate -V30G -T qubes_dom0/root-pool -n root lvcreate -V( size in your vm pool )G -T qubes_dom0/vm-pool -n vm mkfs.ext4 /dev/qubes_dom0/vm

https://forum.qubes-os.org/t/4-1-installer-lvm-partitioning-hard-to-customize-missing-space/6155/5

sjvudp commented 2 years ago

Is https://github.com/QubesOS/qubes-issues/issues/7335#issuecomment-1079844550 suggesting to unplug all the other disks? I don't get it: The installer completed installation for the first boot, but first boot fails to continue. I don't think my initial Qubes OS disk setup is the problem.

DemiMarie commented 2 years ago

Would it be possible to get a backtrace from the core dump?

sjvudp commented 2 years ago

Would it be possible to get a backtrace from the core dump?

By default the segfault does not create a core dump. I'm not deep enough into it: Is it possible to enable coredumps in initrd via some kernel command line boot parameter?

DemiMarie commented 2 years ago

Would it be possible to get a backtrace from the core dump?

By default the segfault does not create a core dump. I'm not deep enough into it: Is it possible to enable coredumps in initrd via some kernel command line boot parameter?

I’m not sure, but it is probably in some documentation.

DemiMarie commented 2 years ago

@sjvudp can you try systemd.mask=lvm2-monitor.service? That prevents the service from being loaded.

ghost commented 2 years ago

no just create partition like that, and follow above command. step by step with picture is here https://forum.qubes-os.org/t/qubes-os-installation-detached-encrypted-boot-and-header/6205

additional storage could be added after install or you could reinstall again and add some device in volume group.

sjvudp commented 2 years ago

@sjvudp can you try systemd.mask=lvm2-monitor.service? That prevents the service from being loaded.

That did it! However, even after having installed all updates, I need that parameter on every boot!

sjvudp commented 2 years ago

There is also some strange effect when shutting down: it waits for some LVM deactivation for a rather long time, delaying the shutdown or reboot.

sjvudp commented 1 year ago

When running a debug shell on tty9 during boot via "systemd.debug_shell=1", I realized that a "vgchange --monitor y" seems to hang. However when I enter that command manually, I get a syntax error (it wants a VG name). When I use "vgchange --monitor y qubes_dom0", the command exits very soon. So it looks to me that some incorrect command is run by systemd.

img1

Eric678 commented 1 year ago

Also having this problem from an in place upgrade to 4.1 from 4.0. Everything seemed to go fine (except for debian qubes) - first boot after upgrade goes into infinite wait for lvm2. Unfortunately grub appears not to be working correctly and I cannot use the workaround mentioned above. Tried disconnecting a couple of disk drives that are a bios RAID1 (mirrored), and not part of Qubes and the 4.1 system comes up fine, guessing lvm2-monitor is looking in the wrong place? Any suggestions how I might work around to get that raid array up? Thanks.

sjvudp commented 1 year ago

I cannot use the workaround mentioned above.

Well, if you cannot do it interactively, you could do it the hard way: Boot some rescue system, mount the /boot filesystem, then edit the GRUB menu file (something like /boot/grub2/grub.cfg, looking for "linux" lines).