dm-vdo / vdo

Userspace tools for managing VDO volumes.
GNU General Public License v2.0
193 stars 32 forks source link

systemd startup issue (Rocky Linux 8) #50

Open hostalp opened 2 years ago

hostalp commented 2 years ago

I'm encountering random startup issues that appear to be related to VDO because it only affects hosts that use it.

The issue affects especially systemd-tmpfiles-setup.service which sometimes doesn't start properly and ends up in the state inactive (dead). This causes issues with other services that depend on files (directories etc.) created by the systemd-tmpfiles.

I've created a small test VM where I can reproduce the issue.

It's running Rocky Linux 8.5 (so RHEL 8.5 clone) at Proxmox VE with 2 disks attached as Virtio SCSI. 1st disk is used by the OS (boot + root). 2nd disk is VDO on the top of the LVM mounted as follows (based on the RHEL 8 docs):

/dev/mapper/test_vdo     /test     xfs     defaults,noatime,logbsize=128k,x-systemd.device-timeout=0,x-systemd.requires=vdo.service     0 0

I simply reboot that VM multiple times and see that the issue occurs approx. 80% of time while in the remaining approx. 20% cases the system starts just fine.

I've found out that adding the mount option _netdev to the VDO backed mount appears to result in successful start at all times. However I consider this as a workaround only because it shouldn't be necessary to use this option with local disks.

I collected some initial information and attached it as vdotest-logs1.zip. It covers 3 scenarios: OK - the system starts properly (no configuration changes when compared to the KO scenario) KO - the system doesn't start properly, units systemd-tmpfiles-setup.service, systemd-update-utmp.service and auditd.service appear to be affected (no configuration changes when compared to the OK scenario) OK-netdev - the only configuration change is the addition of the mount option _netdev with which the system appears to consistently start properly

rhawalsh commented 2 years ago

Hi @hostalp,

I've found out that adding the mount option _netdev to the VDO backed mount appears to result in successful start at all times. However I consider this as a workaround only because it shouldn't be necessary to use this option with local disks.

Yes, I agree that _netdev is not the right thing for the job. I've typically been able to get by with only specifying x-systemd.requires=vdo.service without any other options being necessary... Unless we're talking about stacking devices, in which case it can get a little messy to deal with the right dependencies in the right place for the right units.

I collected some initial information and attached it as vdotest-logs1.zip. It covers 3 scenarios: OK - the system starts properly (no configuration changes when compared to the KO scenario) KO - the system doesn't start properly, units systemd-tmpfiles-setup.service, systemd-update-utmp.service and auditd.service appear to be affected (no configuration changes when compared to the OK scenario) OK-netdev - the only configuration change is the addition of the mount option _netdev with which the system appears to consistently start properly

Please allow me some time to review your logs to see if anything stands out to me.

As an aside, starting with version vdo-6.2.3.91-14.el8, we introduced systemd instantiated services with udev triggers that actually enables you to specify a VDO volume in the /etc/fstab with only default mount options. There may be a bug with it when there are multiple VDO volumes present (I have not personally been able to reproduce it). It may be worth looking at.

I see, just quickly browsing the "OK" logfile messages_OK that you seem to be running 6.2.5.72. I might suggest just trying to drop all of the various mount options in your /etc/fstab and only use defaults.

/dev/mapper/test_vdo /test xfs defaults 0 0

I will post if I notice anything out of place on the tests that don't work as you're expecting.

hostalp commented 2 years ago

As an aside, starting with version vdo-6.2.3.91-14.el8, we introduced systemd instantiated services with udev triggers that actually enables you to specify a VDO volume in the /etc/fstab with only default mount options. There may be a bug with it when there are multiple VDO volumes present (I have not personally been able to reproduce it). It may be worth looking at.

I see, just quickly browsing the "OK" logfile messages_OK that you seem to be running 6.2.5.72. I might suggest just trying to drop all of the various mount options in your /etc/fstab and only use defaults.

/dev/mapper/test_vdo /test xfs defaults 0 0

I tried this approach and it looks good so far - VDO volume mounts & no startup issues even after 20 restarts.

What I noticed however are much (by 90 seconds) longer shutdown times. This doesn't happen if I unmount the VDO volume manually before restart/shutdown.

rhawalsh commented 2 years ago

Try to start and enable the blk-availability.service (systemctl enable --now blk-availability.service). That should address the shutdown issues after the first reboot.

We try to set that in the presets when installing the vdo package, but it doesn't seem to always take effect.

hostalp commented 2 years ago

Yes that helped and the shutdown time is now back to normal. So I'm going to use this new approach.

If this is going to become the preferred method of mounting VDO volumes then it should be probably added to the RHEL docs.

The "older" approach is still valid, correct? Then I guess it would still deserve to be checked for the issue reported here.

rhawalsh commented 2 years ago

Hi @hostalp

Yes that helped and the shutdown time is now back to normal. So I'm going to use this new approach.

Great, I'm glad that was helpful.

If this is going to become the preferred method of mounting VDO volumes then it should be probably added to the RHEL docs.

Thank you for mentioning this. I have raised an issue with our docs team to work out addressing this.

The "older" approach is still valid, correct? Then I guess it would still deserve to be checked for the issue reported here.

Yes. the vdo.service still gets installed and enabled, intended to be a fail-safe, since it just does vdo start --all and if all VDO volumes are already started, it becomes a no-op. Ultimately in my experience, the only mount option that you should really have to use when you don't have the updated behavior in place is to use the x-systemd.requires=vdo.service mount option and anything else to ensure it starts after whatever devices it depends on are present. The rest of the options tend to make it easier to recover from (and things like _netdev just ensure that you've reached a certain point in the boot order such that you can be pretty comfortable that a vdo start --all would have executed by now)