QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
542 stars 48 forks source link

First boot after installing never finishes ("A start job is running for Monitoring of LVM2 mirrors, ...") #7335

Open sjvudp opened 2 years ago

sjvudp commented 2 years ago

How to file a helpful issue

Qubes OS release

4.1.0

Brief summary

Booted installer, perpared the disks, ran installation.

All fine without any error. Then on boot from harddisk nothing happens, i.e. initrd never "switches root"

Steps to reproduce

Expected behavior

Installation continues

Actual behavior

Boot never finishes (see attached screen photo). index

Examining the journal of the failed boots, I found this:

Feb 24 23:52:19 dom0 lvm[1326]:   Device open /dev/sdd1 8:49 failed errno 2
Feb 24 23:52:20 dom0 kernel:  md124: p1
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:1 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:17 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:33 with no paths.
Feb 24 23:52:20 dom0 lvm[1326]:   WARNING: Scan ignoring device 8:49 with no paths.
Feb 24 23:52:20 dom0 dmeventd[3705]: dmeventd ready for processing.
Feb 24 23:52:20 dom0 kernel: lvm[1326]: segfault at 801 ip 0000777003fcfdde sp 00007ffd4db1c028 error 4 in libc-2.31.so[777003e91000+150000]
Feb 24 23:52:20 dom0 kernel: Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21>
Feb 24 23:52:20 dom0 kernel: audit: type=1701 audit(1645743140.034:101): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1326 comm="lvm" exe="/usr/sbin/lvm" sig=11 res=1
Feb 24 23:52:20 dom0 audit[1326]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 pid=1326 comm="lvm" exe="/usr/sbin/lvm" sig=11 res=1
Feb 24 23:52:20 dom0 lvm[3705]: Monitoring thin pool qubes_dom0-pool00-tpool.
Feb 24 23:52:20 dom0 lvm[2561]:   3 logical volume(s) in volume group "qubes_dom0" now active
Feb 24 23:52:20 dom0 systemd[1]: Finished LVM event activation on device 253:0.

That segfault doesn't look good!

The last things that seem to happen on boot are:

Feb 24 23:52:22 dom0 systemd[1]: Finished udev Wait for Complete Device Initialization.
Feb 24 23:52:22 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udev-settle comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 kernel: audit: type=1130 audit(1645743142.001:103): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udev-settle comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=>
Feb 24 23:52:22 dom0 systemd[1]: Starting Activation of DM RAID sets...
Feb 24 23:52:22 dom0 systemd[1]: dmraid-activation.service: Succeeded.
Feb 24 23:52:22 dom0 systemd[1]: Finished Activation of DM RAID sets.
Feb 24 23:52:22 dom0 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 24 23:52:22 dom0 kernel: audit: type=1130 audit(1645743142.797:104): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su>
Feb 24 23:52:22 dom0 kernel: audit: type=1131 audit(1645743142.797:105): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dmraid-activation comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su>

A key indicator might be "kernel: md124: p1" that could mean mdadm was built without IRST support: Usually there are four md-devices (two per IRST RAID: one real RAID and one pseudo RAID (IRST)). Maybe LVM chokes on the pseudo RAID.

Additional info (that might be important)

The system has two IRST software RAID1 (one for Windows, one for Linux, but none for Qubes OS (that is on a different non-RAID disk)

sjvudp commented 2 years ago

I wonder: Could I enable core dumps easily in the system, and would it be helpful to have the actual core dump? Or have a different lvm to replace the one that is failing?

ghost commented 2 years ago

can you provide how you do custom partition ?

sjvudp commented 2 years ago

can you provide how you do custom partition ?

Basically I was following https://www.qubes-os.org/doc/custom-install/ with the exception that I skipped partitioning as the correct partitions did exist already. So I started from creating a new LUKS device on an existing partition.

ghost commented 2 years ago

Mostly it because you have wrong partition / bad lvm config, maybe try zero disk first, then follow the guide again.

sjvudp commented 2 years ago

Mostly it because you have wrong partition / bad lvm config, maybe try zero disk first, then follow the guide again.

I'm unsure whether speculation is the best way to solve the issue: I can boot Windows 10 without a problem, I can boot Linux (openSUSE Laap 15.3) without a problem, I can boot Tails without a problem, I can boot Qubes OS 4.0 without a problem. So I might conclude that there is no problem with the four disks in the PC (sda-sdd). Zeroing the disks without a really good reason is like those installation instructions for an MS-DOS software back in the 1990ies that started with "format c:"...

sjvudp commented 2 years ago

Here are some details: The disk's partitions are:

# fdisk -l /dev/sdg
Disk /dev/sdg: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Disk model: Generic         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x36a23d8a

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sdg1  *         2048   1435647   1433600   700M 83 Linux
/dev/sdg2         1435648 211150847 209715200   100G 83 Linux
/dev/sdg3       211150848 630581247 419430400   200G 83 Linux
/dev/sdg4       630581248 976773167 346191920 165.1G da Non-FS data

Partition1 and 2 are the relevant ones:

# blkid /dev/sdg*
/dev/sdg: PTUUID="36a23d8a" PTTYPE="dos"
/dev/sdg1: LABEL="Boot" UUID="f739b964-640b-4381-b47b-0c8b74bb69ee" TYPE="ext4" PARTUUID="36a23d8a-01"
/dev/sdg2: UUID="a10e21f9-2581-47f7-819a-ec06fde599a1" TYPE="crypto_LUKS" PARTUUID="36a23d8a-02"
/dev/sdg3: UUID="ce6f1f45-e9a8-4609-9b55-c4ee7eeb2938" TYPE="crypto_LUKS" PARTUUID="36a23d8a-03"
/dev/sdg4: PARTUUID="36a23d8a-04"

The corresponding command line from GRUB is: module2 /vmlinuz-5.10.90-1.fc32.qubes.x86_64 placeholder root=/dev/mapper/qubes_dom0-root ro rd.luks.uuid=luks-a10e21f9-2581-47f7-819a-ec06fde599a1 rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_dom0/swap plymouth.ignore-serial-consoles i915.alpha_support=1 rd.driver.pre=btrfs rhgb quiet I can decrypt and mount the root volume without a problem (from Tails):

# cryptsetup luksOpen /dev/sdg2 crypt
Enter passphrase for /dev/sdg2: 
# vgs
  VG         #PV #LV #SN Attr   VSize  VFree
  qubes_dom0   1   3   0 wz--n- 99.98g    0 
# pvs
  PV                VG         Fmt  Attr PSize  PFree
  /dev/mapper/crypt qubes_dom0 lvm2 a--  99.98g    0 

# lvs
  LV     VG         Attr       LSize  Pool   Origin Data%  Meta%  Move Log Cpy%Sync Convert
  pool00 qubes_dom0 twi-aotz-- 89.80g               10.53  15.48                           
  root   qubes_dom0 Vwi-a-tz-- 98.80g pool00        9.57                                   
  swap   qubes_dom0 -wi-a----- 10.00g                                           
# umount /mnt
# mount /dev/qubes_dom0/root /mnt
# ls -l /mnt
total 76
lrwxrwxrwx    1 root root     7 Jan 28  2020 bin -> usr/bin
drwxr-xr-x    2 root root  4096 Feb 24 23:42 boot
drwxr-xr-x    2 root root  4096 Feb 24 23:42 dev
drwxr-xr-x  102 root root  4096 Mar  6 01:00 etc
drwxr-xr-x    3 root root  4096 Feb 24 23:47 home
lrwxrwxrwx    1 root root     7 Jan 28  2020 lib -> usr/lib
lrwxrwxrwx    1 root root     9 Jan 28  2020 lib64 -> usr/lib64
drwx------.   2 root root 16384 Feb 24 23:42 lost+found
drwxr-xr-x    2 root root  4096 Jan 28  2020 media
drwxr-xr-x    2 root root  4096 Jan 28  2020 mnt
drwxr-xr-x    2 root root  4096 Jan 28  2020 opt
drwxr-xr-x    2 root root  4096 Feb 24 23:42 proc
dr-xr-x---    2 root root  4096 Feb 24 23:49 root
drwxr-xr-x    2 root root  4096 Feb 24 23:42 run
lrwxrwxrwx    1 root root     8 Jan 28  2020 sbin -> usr/sbin
drwxr-xr-x    6 root root  4096 Feb 24 23:44 srv
drwxr-xr-x    2 root root  4096 Feb 24 23:42 sys
drwxrwxrwt    2 root root  4096 Feb 24 23:48 tmp
drwxr-xr-x   12 root root  4096 Feb 24 23:43 usr
drwxr-xr-x   18 root root  4096 Feb 24 23:43 var
sjvudp commented 2 years ago

Running Qubes OS 4.0, my four IRST disks look like this:

[root@dom0 master]# cat /proc/mdstat 
Personalities : [raid1] 
md124 : active (auto-read-only) raid1 sdc[1] sdd[0]
      976748544 blocks super external:/md125/0 [2/2] [UU]

md125 : inactive sdc[1](S) sdd[0](S)
      6306 blocks super external:imsm

md126 : active (auto-read-only) raid1 sdb[1] sda[0]
      1953497088 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive sdb[1](S) sda[0](S)
      6306 blocks super external:imsm

unused devices: <none>

And here are the details:

[root@dom0 master]# mdadm -E /dev/md124
/dev/md124:
   MBR Magic : aa55
Partition[0] :    973078528 sectors at         2048 (type 07)
[root@dom0 master]# mdadm -E /dev/md125
/dev/md125:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 3c238f66
         Family : 3c238f66
     Generation : 0022ca9e
     Attributes : All supported
           UUID : ac1aeea1:1f4da1eb:dd658400:4fb4228b
       Checksum : 34e02dc2 correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : JA10001F1V5JGM
          State : active
             Id : 00000004
    Usable Size : 1953518862 (931.51 GiB 1000.20 GB)

[RAID1_1T]:
           UUID : 7722150f:300ac03c:a80fc2e1:9e86eeaf
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 1
     Array Size : 1953497088 (931.50 GiB 1000.19 GB)
   Per Dev Size : 1953497352 (931.50 GiB 1000.19 GB)
  Sector Offset : 0
    Num Stripes : 7630848
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean

  Disk00 Serial : JA10001F1V3JDM
          State : active
             Id : 00000003
    Usable Size : 1953518862 (931.51 GiB 1000.20 GB)
[root@dom0 master]# mdadm -E /dev/md126
/dev/md126:
   MBR Magic : aa55
Partition[0] :    524288000 sectors at         2048 (type 07)
Partition[1] :   1073741824 sectors at    524290048 (type 07)
Partition[2] :   1073741824 sectors at   1598031872 (type 07)
Partition[3] :   1235220480 sectors at   2671773696 (type 05)
[root@dom0 master]# mdadm -E /dev/md127
/dev/md127:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 2b19b6dc
         Family : 2b19b6dc
     Generation : 00158e2d
     Attributes : All supported
           UUID : 87deeede:0bdbdaed:d994ce4d:e2ad4cf8
       Checksum : 56c1ecdc correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk00 Serial : WD-WCC4N7KUR3XK
          State : active
             Id : 00000001
    Usable Size : 3907022862 (1863.01 GiB 2000.40 GB)

[RAID1_2T]:
           UUID : fe1588c4:a2140388:5b4117ba:4e4339b9
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 0
     Array Size : 3906994176 (1863.00 GiB 2000.38 GB)
   Per Dev Size : 3906994440 (1863.00 GiB 2000.38 GB)
  Sector Offset : 0
    Num Stripes : 15261696
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean

  Disk01 Serial : WD-WCC4M3PEVR3R
          State : active
             Id : 00000000
    Usable Size : 3907022862 (1863.01 GiB 2000.40 GB)
[root@dom0 master]# 
sjvudp commented 2 years ago

While no fixed installation medium is available, I'd suggest to add something helpful to https://www.qubes-os.org/doc/installation-troubleshooting/ (https://www.qubes-os.org/doc/installation-troubleshooting/#warning-dracut-initqueue-timeout---starting-timeout-scripts-during-installation isn't helpful as it refers to the initial installation boot, not the first boot after installation (where the VMs are configured))

ghost commented 2 years ago

try with single disk first :

cryptsetup -c aes-xts-plain64 -h sha512 -s 512 luksFormat /dev/sda2 crypsetup luksOpen /dev/sda2 luks pvcreate /dev/mapper/luks vgcreate qubes_dom0 /dev/mapper/luks lvcreate -n swap -L 4G qubes_dom0 lvcreate -T -L 30G qubes_dom0/root-pool lvcreate -T -l +95%FREE qubes_dom0/vm-pool lvs lvcreate -V30G -T qubes_dom0/root-pool -n root lvcreate -V( size in your vm pool )G -T qubes_dom0/vm-pool -n vm mkfs.ext4 /dev/qubes_dom0/vm

https://forum.qubes-os.org/t/4-1-installer-lvm-partitioning-hard-to-customize-missing-space/6155/5

sjvudp commented 2 years ago

Is https://github.com/QubesOS/qubes-issues/issues/7335#issuecomment-1079844550 suggesting to unplug all the other disks? I don't get it: The installer completed installation for the first boot, but first boot fails to continue. I don't think my initial Qubes OS disk setup is the problem.

DemiMarie commented 2 years ago

Would it be possible to get a backtrace from the core dump?

sjvudp commented 2 years ago

Would it be possible to get a backtrace from the core dump?

By default the segfault does not create a core dump. I'm not deep enough into it: Is it possible to enable coredumps in initrd via some kernel command line boot parameter?

DemiMarie commented 2 years ago

Would it be possible to get a backtrace from the core dump?

By default the segfault does not create a core dump. I'm not deep enough into it: Is it possible to enable coredumps in initrd via some kernel command line boot parameter?

I’m not sure, but it is probably in some documentation.

DemiMarie commented 2 years ago

@sjvudp can you try systemd.mask=lvm2-monitor.service? That prevents the service from being loaded.

ghost commented 2 years ago

no just create partition like that, and follow above command. step by step with picture is here https://forum.qubes-os.org/t/qubes-os-installation-detached-encrypted-boot-and-header/6205

additional storage could be added after install or you could reinstall again and add some device in volume group.

sjvudp commented 2 years ago

@sjvudp can you try systemd.mask=lvm2-monitor.service? That prevents the service from being loaded.

That did it! However, even after having installed all updates, I need that parameter on every boot!

sjvudp commented 2 years ago

There is also some strange effect when shutting down: it waits for some LVM deactivation for a rather long time, delaying the shutdown or reboot.

sjvudp commented 2 years ago

When running a debug shell on tty9 during boot via "systemd.debug_shell=1", I realized that a "vgchange --monitor y" seems to hang. However when I enter that command manually, I get a syntax error (it wants a VG name). When I use "vgchange --monitor y qubes_dom0", the command exits very soon. So it looks to me that some incorrect command is run by systemd.

img1

Eric678 commented 1 year ago

Also having this problem from an in place upgrade to 4.1 from 4.0. Everything seemed to go fine (except for debian qubes) - first boot after upgrade goes into infinite wait for lvm2. Unfortunately grub appears not to be working correctly and I cannot use the workaround mentioned above. Tried disconnecting a couple of disk drives that are a bios RAID1 (mirrored), and not part of Qubes and the 4.1 system comes up fine, guessing lvm2-monitor is looking in the wrong place? Any suggestions how I might work around to get that raid array up? Thanks.

sjvudp commented 1 year ago

I cannot use the workaround mentioned above.

Well, if you cannot do it interactively, you could do it the hard way: Boot some rescue system, mount the /boot filesystem, then edit the GRUB menu file (something like /boot/grub2/grub.cfg, looking for "linux" lines).

sjvudp commented 1 month ago

Well, let me update that the issue still exists on a new installation (and new partitioning) of Qubes OS 4.2.2 (same fix still works, however). The problem occurs after switching root and luksOpen the PV. There seems to be a related issue on shutdown, but that has a timeout of 1m30s, not nice, but I can wait for that (the boot thing has no timeout,so it will wait forever). Another note for the impatient: Don't turn off power or press reset when shutdown isn't finished; otherwise the RAID might be "verified" on next boot,and that takes many hours, while read performance drops drastically...

rastababy commented 4 weeks ago

Hello, I have got the same issue, running distro uprade from qubes 4.1 to 4.2. Everytime I boot I will get the Error and the job is running infinite lvm2.monitor-service. I tired to install several Qubes 4.2 isos (weekly Editions as well) and unfortunately on this way I am always getting the error with Panie is dead....Already tried the complete troubleshooting guide but no luck, cause I have an nvidia 980m sli an old mbr, legacy Bios and a second ssd with win11 installed. So but sorry this is offtopic and I need to create a post in the community forum of Qubes. But I am not so strong in IT and so only hope I get Qubes 4.2 to running again is over the in place upgrade and the hint to mask systemd.mask=lvm2-monitor.service this. But I don't know how to mask it? I cannot enter tty2 because after the job is running my boot hangs. maybe a few seconds until this porgress starts I can enter it. But I need to enter a dom0 terminal and tty isn't right or? Can you please explain me where, for example in grub edit section, tty2 or how excatly I can mask this service before booting? If it will work I will update of course. Thank you very much

sjvudp commented 4 weeks ago

Hi!

You must interrupt the grub boot manager (e.g. by moving the cursor), then press "e" for edit over the boot entry. Then you can edit the boot entry, but that's not permanent. After edit you must boot that entry (e.g. press F10). Once booted you can edit the grub boot menu on disk to make the change persistent. Now you just have to add the additional parameter at the correct place.

I'm just wondering whether Qubes OS or Fedora introduced that bug.

See also https://forum.qubes-os.org/t/how-to-edit-kernel-boot-parameters-when-starting-qubes/4878 https://xcp-ng.org/forum/topic/8092/add-kernel-boot-params-for-dom0

Regards, Ulrich

27.10.2024 15:10:01 rastababy @.***>:

few

rastababy commented 3 weeks ago

thank you so much. My problem was: I did not know where to enter the command systemd.mask=lvm2-monitor.service. If I have to press ctrl+c to enter the grub command line, too or however. No I pressed e, over the first boot entriy under the advanced boot options and at the beginning of the grub I entered the command, pressed f10 and qubes finally boots up :) This was the second thing I did not know, if to enter the command at the end or beginning of the grub, but for everybody with the same problems in future like me, I wrote it at the beginning, followed sjvudp hints and everything is working. Now I need to check why Qubes 4.2 is the first .iso that I cannot install directly from usb, because of the pane is dead error, but I hope I don't need to set up my laptop again, so everything should work. Btw. I don't know what Qubes 4.3 alpha is using for a Fedora version, but I in my doubt I downloaded this iso ,too and tried it and got the same error. So maybe it is not a problem of Fedora, but sorry just a suggest and I don't know. Thank you again. Stay healthy and much blesses rasta

Now I will try to make it permanent, but I have an idea like edit the grub.cfg and the update and add the above stanza, but I need to google a bit more where the correct place ist.