dm-vdo / vdo-devel

Primary VDO mainline development repository
GNU General Public License v2.0
5 stars 11 forks source link

Timed out waiting for VDO device #3

Open tigerblue77 opened 1 year ago

tigerblue77 commented 1 year ago

Hello, I run a Debian 11(.6) bare-metal with 2 LVM VDO volumes. Everything was working fine but now my operating system fails booting because it reaches a timeout waiting for one of the VDO volumes (not the biggest by the way). So it goes in emergency mode, I type the root password then run mount -a && exit and it mounts the volume like a charm then starts without any problem. I didn't do any big change on this side so I don't know what could be wrong neither how to investigate.

Here are my boot logs from journalctl, I just cutted start and end, nothing in the middle. I just separated the error to make it clearer :

[...]
mars 18 19:12:54 Ultron systemd[1]: Found device /dev/mapper/Ultron--vg-swap_1.
mars 18 19:12:54 Ultron systemd-udevd[492]: Using default interface naming scheme 'v247'.
mars 18 19:12:54 Ultron systemd-udevd[492]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
mars 18 19:12:54 Ultron systemd[1]: Activating swap /dev/mapper/Ultron--vg-swap_1...
mars 18 19:12:54 Ultron kernel: Adding 999420k swap on /dev/mapper/Ultron--vg-swap_1.  Priority:-2 extents:1 across:999420k FS
mars 18 19:12:54 Ultron systemd[1]: Activated swap /dev/mapper/Ultron--vg-swap_1.
mars 18 19:12:54 Ultron systemd[1]: Reached target Swap.
mars 18 19:12:57 Ultron systemd[1]: Finished Wait for udev To Complete Device Initialization.
mars 18 19:12:57 Ultron systemd[1]: Starting LVM direct activation of logical volumes...
mars 18 19:12:57 Ultron kernel: kvdo: loading out-of-tree module taints kernel.
mars 18 19:12:57 Ultron kernel: kvdo: module verification failed: signature and/or required key missing - tainting kernel
mars 18 19:12:57 Ultron kernel: kvdo: modprobe: loaded version 8.2.1.3
mars 18 19:12:57 Ultron kernel: kvdo0:lvm: Detected version mismatch between kernel module and tools kernel: 4, tool: 2
mars 18 19:12:57 Ultron kernel: kvdo0:lvm: Please consider upgrading management tools to match kernel.
mars 18 19:12:57 Ultron kernel: kvdo0:lvm: loading device '254:3'
mars 18 19:12:57 Ultron kernel: kvdo0:lvm: zones: 6 logical, 1 physical, 6 hash; total threads: 42
mars 18 19:13:00 Ultron kernel: kvdo0:lvm: starting device '254:3'
mars 18 19:14:17 Ultron kernel: kvdo0:journalQ: Setting UDS index target state to online
mars 18 19:14:17 Ultron kernel: kvdo0:dedupeQ: loading or rebuilding index: /dev/dm-2
mars 18 19:14:17 Ultron kernel: kvdo0:dedupeQ: Using 12 indexing zones for concurrency.
mars 18 19:14:17 Ultron kernel: kvdo0:lvm: device '254:3' started
mars 18 19:14:17 Ultron kernel: kvdo0:lvm: resuming device '254:3'
mars 18 19:14:17 Ultron kernel: kvdo0:lvm: device '254:3' resumed
mars 18 19:14:17 Ultron kernel: kvdo0:lvm: Setting UDS index target state to offline
mars 18 19:14:17 Ultron systemd[1]: Started Device-mapper event daemon.
mars 18 19:14:17 Ultron dmeventd[764]: dmeventd ready for processing.
mars 18 19:14:17 Ultron lvm[764]: Monitoring VDO pool Ultron--vg-ZEROED--VDO--POOL--1-vpool.
mars 18 19:14:17 Ultron kernel: kvdo1:lvm: Detected version mismatch between kernel module and tools kernel: 4, tool: 2
mars 18 19:14:17 Ultron kernel: kvdo1:lvm: Please consider upgrading management tools to match kernel.
mars 18 19:14:17 Ultron kernel: kvdo1:lvm: loading device '254:6'
mars 18 19:14:17 Ultron systemd[1]: Found device /dev/Ultron-vg/ZEROED-VDO-LV-1.
mars 18 19:14:17 Ultron kernel: kvdo1:lvm: zones: 6 logical, 1 physical, 6 hash; total threads: 42
mars 18 19:14:18 Ultron kernel: kvdo0:physQ0: VDO commencing normal operation
mars 18 19:14:18 Ultron kernel: kvdo0:dedupeQ: loaded index from chapter 0 through chapter 15
mars 18 19:14:21 Ultron kernel: kvdo1:lvm: starting device '254:6'
mars 18 19:14:22 Ultron systemd[1]: dev-Ultron\x2dvg-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.device: Job dev-Ultron\x2dvg-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.device/start timed out.
mars 18 19:14:22 Ultron systemd[1]: Timed out waiting for device /dev/Ultron-vg/COMPRESSED-DEDUPLICATED-VDO-LV-1.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for /mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Local File Systems.
mars 18 19:14:22 Ultron systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for /tmp.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Network Time Synchronization.
mars 18 19:14:22 Ultron systemd[1]: systemd-timesyncd.service: Job systemd-timesyncd.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: tmp.mount: Job tmp.mount/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for /var.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Platform Persistent Storage Archival.
mars 18 19:14:22 Ultron systemd[1]: systemd-pstore.service: Job systemd-pstore.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Load AppArmor profiles.
mars 18 19:14:22 Ultron systemd[1]: apparmor.service: Job apparmor.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
mars 18 19:14:22 Ultron systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Load/Save Random Seed.
mars 18 19:14:22 Ultron systemd[1]: systemd-random-seed.service: Job systemd-random-seed.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Update UTMP about System Boot/Shutdown.
mars 18 19:14:22 Ultron systemd[1]: Dependency failed for Update UTMP about System Runlevel Changes.
mars 18 19:14:22 Ultron systemd[1]: systemd-update-utmp-runlevel.service: Job systemd-update-utmp-runlevel.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: systemd-update-utmp.service: Job systemd-update-utmp.service/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: var.mount: Job var.mount/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: mnt-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.mount: Job mnt-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.mount/start failed with result 'dependency'.
mars 18 19:14:22 Ultron systemd[1]: dev-Ultron\x2dvg-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.device: Job dev-Ultron\x2dvg-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.device/start failed with result 'timeout'.
[...]

And here is my fstab file :

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# systemd generates mount units based on this file, see systemd.mount(5).
# Please run 'systemctl daemon-reload' after making changes here.
#
# <file system>                                 <mount point>                           <type>  <options>                                 <dump>   <pass>
/dev/mapper/Ultron--vg-root                     /                                       ext4    errors=remount-ro                         01
# /boot was on /dev/sdb2 during installation
UUID=aafc24cb-e044-48e1-b399-be893c411fc4       /boot                                   ext2    defaults                                  02
# /boot/efi was on /dev/sdb1 during installation
UUID=CE05-25BD                                  /boot/efi                               vfat    umask=0077                                01
/dev/mapper/Ultron--vg-swap_1                   none                                    swap    sw                                        00
#/dev/Ultron-vg/ZEROED-VDO-LV-1                 /mnt/ZEROED-VDO-LV-1                    ext4    rw,suid,dev,exec,auto,nouser,sync,discard 00
/dev/Ultron-vg/ZEROED-VDO-LV-1 /mnt/ZEROED-VDO-LV-1 ext4 rw,suid,dev,exec,auto,nouser,sync 0 0
#/dev/Ultron-vg/COMPRESSED-DEDUPLICATED-VDO-LV-1        /mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1   ext4    rw,suid,dev,exec,auto,nouser,sync,discard                                          0       0
/dev/Ultron-vg/COMPRESSED-DEDUPLICATED-VDO-LV-1 /mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1 ext4 rw,suid,dev,exec,auto,nouser,sync 0 0
/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1/home      /home                                   none    x-systemd.requires=/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1,x-systemd.automount,bind  0       0
/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1/tmp       /tmp                                    none    x-systemd.requires=/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1,x-systemd.automount,bind  0       0
/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1/var       /var                                    none    x-systemd.requires=/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1,x-systemd.automount,bind  0       0
sweettea commented 1 year ago

Oooh that does look awful, I'm sorry, I didn't realize before replying that it was a github issue and not an email thread. Important parts of it below:

It looks to me like lvm might be starting the second VDO only after the first, not in parallel, and the first takes long enough to time out the second. If you look at the log messages, the first VDO took 1:20, and systemd decided the service had failed at 1:25.

You might consider upping the timeout that systemd is using to mark it as failed: create /etc/systemd/system/lvm2-activation.service.d/10-timeout.conf and make it contain

|[Service]
   TimeoutSec=4m

in order to bump the timeout to 4m. (I think this is the right unit, but you might need to also do this for lvm2-activation-early.service and lvm2-activation-net.service.)

Hopefully this works for you!

tigerblue77 commented 1 year ago

Hello @sweettea, Thanks for reformatting your message ! 😁

tigerblue77 commented 1 year ago

Hello @sweettea, I understood your suggestion to increase the LVM timeout but I'm not a big fan as I want my boot to be as fast as possible and so, to parrallelise all that is possible. Following your answer, here are the questions that came to my mind :

sweettea commented 1 year ago

I understood your suggestion to increase the LVM timeout but I'm not a big fan as I want my boot to be as fast as possible and so, to parrallelise all that is possible. Following your answer, here are the questions that came to my mind :

* Why is my LVM not working in parrallel ? Is it possible to change this setting ?

I believe there's an lvm.conf setting, global/event_activation, which if set to 1 will use event-based startup and might do so in parallel. I'm not sure how this works with VDO, having not worked on VDO for a year, but it's possibly something to try. Also, I don't use Debian and don't know for sure if their version of lvm supports event_activation.

* Why did it stop working from one day to another ? (I know you have no answer of course)

Different recovery time -- could have been caused any number of ways, perhaps there was more disk traffic or memory pressure this time, or maybe this one had to do more recovery, or ...

* Can the size of volumes be the problem ? My `ZEROED-VDO-LV` is 15,3 TB and my `COMPRESSED-DEDUPLICATED-VDO-LV` is 1,2 TB. Maybe I should reorder the fstab entries so that `/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1`, which have bind mount dependencies, be mounted first ? (if LVM parrallelisation is not possible)

LVM startup order is independent of fstab order.

* Maybe I do have some mount dependencies errors/incoherence ?

  * Is my `fstab` **OK** for you ?
    You shouldn't see anything exotic in my configuration :

    * 3 mounts (`/`, `/mnt/ZEROED-VDO-LV-1` and `/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1`)
    * 3 bind mounts on the `/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1` volume (as the creation of multiple VDOdataLVs on the same VDO pool is not possible at this time)
  * Is my `systemd` dependency tree **OK** to you ? (I didn't created entries manually)
    Here is the partial output of `systemctl list-dependencies` command (added `<===` to lines that should matter IMO) :
    ```
    [...]
    ●   ├─smbd.service
    ●   ├─ssh.service
    ●   ├─systemd-ask-password-wall.path
    ●   ├─systemd-logind.service
    ●   ├─systemd-update-utmp-runlevel.service
    ●   ├─systemd-user-sessions.service
    ●   ├─basic.target
    ●   │ ├─-.mount                                                    <===
    ●   │ ├─tmp.mount                                                  <===
    ●   │ ├─var.mount                                                  <===
    ●   │ ├─paths.target
    ●   │ ├─slices.target
    ●   │ │ ├─-.slice
    ●   │ │ └─system.slice
    ●   │ ├─sockets.target
    ●   │ │ ├─dbus.socket
    ●   │ │ ├─dm-event.socket
    ●   │ │ ├─docker.socket
    ●   │ │ ├─systemd-initctl.socket
    ●   │ │ ├─systemd-journald-audit.socket
    ●   │ │ ├─systemd-journald-dev-log.socket
    ●   │ │ ├─systemd-journald.socket
    ●   │ │ ├─systemd-udevd-control.socket
    ●   │ │ └─systemd-udevd-kernel.socket
    ●   │ ├─sysinit.target
    ●   │ │ ├─apparmor.service
    ●   │ │ ├─blk-availability.service
    ●   │ │ ├─dev-hugepages.mount
    ●   │ │ ├─dev-mqueue.mount
    ●   │ │ ├─keyboard-setup.service
    ●   │ │ ├─kmod-static-nodes.service
    ●   │ │ ├─lvm2-lvmpolld.socket
    ●   │ │ ├─lvm2-monitor.service
    ●   │ │ ├─proc-sys-fs-binfmt_misc.automount
    ●   │ │ ├─sys-fs-fuse-connections.mount
    ●   │ │ ├─sys-kernel-config.mount
    ●   │ │ ├─sys-kernel-debug.mount
    ●   │ │ ├─sys-kernel-tracing.mount
    ●   │ │ ├─systemd-ask-password-console.path
    ●   │ │ ├─systemd-binfmt.service
    ●   │ │ ├─systemd-boot-system-token.service
    ●   │ │ ├─systemd-hwdb-update.service
    ●   │ │ ├─systemd-journal-flush.service
    ●   │ │ ├─systemd-journald.service
    ●   │ │ ├─systemd-machine-id-commit.service
    ●   │ │ ├─systemd-modules-load.service
    ●   │ │ ├─systemd-pstore.service
    ●   │ │ ├─systemd-random-seed.service
    ●   │ │ ├─systemd-sysctl.service
    ●   │ │ ├─systemd-sysusers.service
    ●   │ │ ├─systemd-timesyncd.service
    ●   │ │ ├─systemd-tmpfiles-setup-dev.service
    ●   │ │ ├─systemd-tmpfiles-setup.service
    ●   │ │ ├─systemd-udev-trigger.service
    ●   │ │ ├─systemd-udevd.service
    ●   │ │ ├─systemd-update-utmp.service
    ●   │ │ ├─cryptsetup.target
    ●   │ │ ├─local-fs.target                                          <===
    ●   │ │ │ ├─-.mount                                                <===
    ●   │ │ │ ├─boot-efi.mount
    ●   │ │ │ ├─boot.mount
    ●   │ │ │ ├─home.automount                                         <===
    ●   │ │ │ ├─mnt-COMPRESSED\x2dDEDUPLICATED\x2dVDO\x2dLV\x2d1.mount <===
    ●   │ │ │ ├─mnt-ZEROED\x2dVDO\x2dLV\x2d1.mount                     <===
    ●   │ │ │ ├─systemd-fsck-root.service
    ●   │ │ │ ├─systemd-remount-fs.service
    ●   │ │ │ ├─tmp.automount                                          <===
    ●   │ │ │ └─var.automount                                          <===
    ●   │ │ └─swap.target
    ●   │ │   └─dev-mapper-Ultron\x2d\x2dvg\x2dswap_1.swap
    [...]
    ```

    I suppose that the 3 `.automount` entries correspond to the 3 `fstab` bind mounts. But why do I have `tmp.mount` and `var.mount` entries ? And no `home.mount` entry ? Seems strange to me. Maybe it's left from initial installation ? (before creation/addition of VDO volumes)

Note that for tmp/var/root there's both a mount unit and a automount unit -- systemd magically activates the mount unit the first time accesses the path to an automount (and can unmount if inactive for a while). (Or maybe somehow it's a tmpfs for the initrd? Not sure about how exactly systemd does that).

And for /home, you might be seeing that auto-unmounting, since there's an automount unit and not a mount unit... you could look at 'systemd status home.mount' and see how it looks.

Hope this helps a little -- this is definitely bumping at the edges of my systemd knowledge, sorry :)

tigerblue77 commented 1 year ago

I believe there's an lvm.conf setting, global/event_activation, which if set to 1 will use event-based startup and might do so in parallel. I'm not sure how this works with VDO, having not worked on VDO for a year, but it's possibly something to try. Also, I don't use Debian and don't know for sure if their version of lvm supports event_activation.

event_activation = 1 doesn't work for me as it doesn't activate VDO volumes so I set it to 0. Maybe it's not the right way doing it but I opened this issue about this.

cat /etc/lvm/lvm.conf | grep event_activation
        # Configuration option global/event_activation.
        # When event_activation is disabled, the system will generally run
        # event_activation = 1
        event_activation = 0

Different recovery time -- could have been caused any number of ways, perhaps there was more disk traffic or memory pressure this time, or maybe this one had to do more recovery, or...

Very strange because now, after upgrade to Debian 11.7, it's fixed without any change... :( sorry if someone get the same problem that he doesn't find any clear solution there.

LVM startup order is independent of fstab order.

Okay thanks

Note that for tmp/var/root there's both a mount unit and a automount unit -- systemd magically activates the mount unit the first time accesses the path to an automount (and can unmount if inactive for a while). (Or maybe somehow it's a tmpfs for the initrd? Not sure about how exactly systemd does that). And for /home, you might be seeing that auto-unmounting, since there's an automount unit and not a mount unit... you could look at 'systemd status home.mount' and see how it looks. Hope this helps a little -- this is definitely bumping at the edges of my systemd knowledge, sorry :)

Thanks for your help! I think it's not a VDO problem then.

tigerblue77 commented 1 year ago

Hello @sweettea, I reopen this issue because it has "recurred" (I think it never disappeared). So I applied your suggestion about increasing LVM timeout although I am not a big fan of it. I think the problem is elsewhere and that the LVM VDO volumes should mount in parallel (if possible). Also, a mount time that reaches the LVM timeout on professional hardware indicates, I think, an underlying problem with VDO/KVDO.

On my Debian setup, there was no /etc/systemd/system/lvm2-activation.service.d/10-timeout.conf file so I did some searches and used this SUSE topic and added x-systemd.device-timeout=600 to each of my VDO mounts in /etc/fstab and it seems to be a working fix (but not a "good" fix IMHO).