dm-vdo / kvdo

A kernel module which provide a pool of deduplicated and/or compressed block storage.
GNU General Public License v2.0
241 stars 46 forks source link

Migrating to the in-kernel KVDO, or "Recovery journal is in the old format" #85

Open thememika opened 7 months ago

thememika commented 7 months ago

After reading the 6.9-rc1 update notice from Linus, I was excited to know that KVDO has been finally merged to the Linux kernel. But unfortunately, I'm not able to use any of my production VDO devices with it. The recovery journal is in old format, and as I got, there is no way to bring the devices up for r/w without re-creating them and copying terabytes of data.

[ 5813.749275] device-mapper: vdo: dm-vdo3:journal: Recovery journal is in the old format, a read-only rebuild is required.: VDO Status: Unsupported component version (1471)

I don't understand why you change the format of journal without writing a conversion code for it. What is the reason behind doing it? I can't use the in-kernel KVDO because of that. Is it true that everyone will now have to re-create the VDO devices from scratch, and copy all the data over?

thememika commented 7 months ago

Sorry, the fix was to first replay the journal using a machine with old version of KVDO. (And shut down the vdo cleanly). Thanks for the update and integration with the Linux kernel!

thememika commented 7 months ago

Unfortunately, I have to reopen the issue because there is likely a bug related to the new version (in-kernel) of KVDO. More specifically, once you attempt to bring up a VDO device which uses the old journal format and is dirty, after you get the "Unsupported component" error, it will be no longer possible to correctly replay the journal — even when you use the old (out-of-tree) KVDO. The device will always be treated as clean, regardless of if it is dirty or not. It leads to an attempt of read-write access to a dirty device without any replay, which is turn results in errors very soon. Example with one of my devices:

[  946.149687] xfs filesystem being mounted at /******* supports timestamps until 2038-01-19 (0x7fffffff)
...
[ 1038.482749] device-mapper: vdo: dm-vdo1:cpuQ0: Completing read vio for LBN 65663121 with error after read_data_vio: VDO Status: Compressed block fragment is invalid (1483)
[ 1038.482773] device-mapper: vdo: dm-vdo1:cpuQ0: vdo_status_to_errno: mapping internal status code 1483 (VDO_INVALID_FRAGMENT: VDO Status: Compressed block fragment is invalid) to EIO
...
[ 1038.482930] XFS (dm-9): metadata I/O error in "xfs_btree_read_buf_block+0xb7/0x160" at daddr 0x1f31b328 len 8 error 5

I have just lost two of my devices like that, and as I see, the only option now is a read-only rebuild, copying to another block device (which is R/W), and then a hard fsck — I bet with a moderate data loss.

Other devices, which were not dirty back when I had old KVDO, are operating correctly and stable.

thememika commented 7 months ago

A one more thing to note: the VDO stats are broken (in-kernel KVDO, the names of my devices are replaced with "****")

$ su -c vdostats
Device                   1k-blocks      Used Available Use% Space saving%
****         0         0         0 -2147483648%            0%
****         0         0         0 -2147483648%            0%
****         0         0         0 -2147483648%            0%
****         0         0         0 -2147483648%            0%
****         0         0         0 -2147483648%            0%
****         0         0         0 -2147483648%            0%

While we can live without the userspace VDO tools for some time, I believe that the issue (described in the post above) affecting dirty devices of old journal format is severe. It poses an unpredictable risk to the correctness of devices, especially for those who weren't warned about the issue in any way.

thememika commented 7 months ago

Fix for the broken devices

Sorry again, it turned out that the broken dirty devices were easily fixable by a forced «rebuild». They are now R/W, give no errors, and likely most (if not all) of the data is safe and untouched. To rebuild the vdo device, you first need to stop it (remove the devmapper entry). Then, I used the tools from vdo-devel. These tools contain the ./src/c++/vdo/bin/vdoforcerebuild binary after they are built. I executed that binary with the path to my VDOs' physical devices. As I got, it just sets a mark for rebuild. (It doesn't rebuild by itself). And then, when you attempt to bring up the devices, it may take a significant amount of time. But finally, the device will be absolutely OK, and up for R/W. Thanks! UPD: Although, I still believe that that problem is a bug which needs to be fixed

rhawalsh commented 7 months ago

Hi @thememika, can you share which version you were using prior to upgrading to the in-kernel version? I'm glad you were able to discover the forced rebuild operation to get back into operation.

I'd like for us to reproduce this situation so we can better understand it and figure out what needs to be done.

Thanks, -Andy

thememika commented 7 months ago

Hi @rhawalsh, thanks for your reply! I was using this version before:

kvdo: modprobe: loaded version 8.2.3.3

After that, I moved (from linux-6.8) to linux 6.9-rc2 with it's built-in KVDO.

To reproduce the issue, you can simply:

  1. Use version 8.2.3.3 (out-of-tree) with some previous kernel. Create a VDO device, and be doing active I/O to it. During that, forcely halt the machine — so the device will be dirty.
  2. Then, boot to Linux 6.9. And try to bring up the VDO (which is dirty). The device will be corrupt, and won't work correctly even in the previous kernel/setup with the old KVDO (until you do a forced rebuild).

Thanks!

rhawalsh commented 7 months ago

Hi @thememika. We'll take a look at it. Thank you for the report!

Just to clarify, you did the forced rebuild on the old (8.2.3.3) version, or the in-kernel 6.9-rc2 version?

thememika commented 7 months ago

Thanks for the attention, @rhawalsh. The forced rebuild was done on the 6.9-rc2 kernel. After that, my devices started working, so I stay on the new version.

rhawalsh commented 7 months ago

Hi @thememika, just wanted to give a quick update.

I ran through a few scenarios, and I was able to reproduce what you're seeing. I'm still playing around with things to figure out how best to document this. However, the bottom line is to try and always make sure you're cleanly shutting down the VDO volume(s) before making any changes. Though in my testing from the 8.2.3.3 to 6.9-rc2 on Fedora Rawhide, I was able to repair the volume going both to and from either end after a dirty shutdown by using the vdoforcerebuild utility as mentioned.

My reproduction environment was done using an iSCSI target and two initiators.

I went through a few scenarios.

  1. Graceful shutdown and transfer from RHEL9->Rawhide.
  2. Graceful shutdown and transfer from Rawhide->RHEL9.
  3. Unclean shutdown (using echo b > /proc/sysrq-trigger) on RHEL9 followed by transfer to Rawhide.
  4. Unclean shutdown (using echo b > /proc/sysrq-trigger) on Rawhide followed by transfer to RHEL9.

Scenario 1:

Scenario 2 is the same as Scenario 1, but with the initiators swapped.

Scenario 3 replaced the step where VDO is gracefully stopped with a forced reboot via echo b > /proc/sysrq-trigger followed by a removal of the initiator's access to the LUN on the target to prevent a possible multiple activation situation. Then, Initiator 2 logged into the target, where we could see the same errors you reported previously. Upon running vdoforcerebuild /dev/mapper/vg_name-vpool0_vdata followed by a deactivate/activate cycle (lvchange -an /dev/vg_name/vdo_name; lvchange -ay /dev/vg_name/vdo_name). At this point the volume's normal operation has resumed after performing the read-only rebuild. The remaining steps in Scenario 1 and 2 continued from here.

Scenario 4 is the same as Scenario 3, but with Initiators swapped, once again.

rhawalsh commented 7 months ago

I did a little bit more digging into the behavior that was experienced here, since just a basic 'unclean shutdown' shouldn't really render a VDO volume to require a forced rebuild.

If I take the same setup as mentioned above, and set up a volume on a RHEL9 host, forcibly reset the host, and then let the system come back up, the VDO volume starts with an automatic recovery and no manual intervention required. It is specifically the act of trying to move the dirty volume from RHEL9 to the upstream version that causes this particular behavior.

I don't believe this is particularly a new thing with VDO volumes. We've always had an "upgrade" when moving from one version to the next. Though, I can't say that I've seen the issue where simply attempting to start a dirty volume on the new version would cause it to require a force-rebuild. That's something I need to talk with the team about a bit more.

beertje44 commented 6 months ago

First of all thank you for your efforts and work, this is just great!

On the bright side: performance seems improved :) I went from 2000MB/s max throughput to 3500 (non vdo partition is still about 5000 MB/s though, so I guess there is some room left for improvements). Max IOPS is about 400000 which is in line with the non vdo partition.

On the not so bright side: at first everyting just worked with 6.9 but after some onrelated reboots to the old dkms vdo and some fiddling with my system I too faced the read only problem mentioned above. What finally helped me was firstly I was lucky enough to have included the vdo userspace tools in my initramfs and second that I managed to figure out the right vdoforcerebuild command:

vdoforcerebuild /dev/mapper/neo-vpool0_vdata

You can forget about that neo part :) But the rest seems important: you don't point it to your virtual combined LV device but rather to the underlying data LV ... right?

In the end nothing was lost and I'm happy now :+1: Perhaps this is of use to others.

lorelei-sakai commented 6 months ago

That is correct. vdoforcerebuild, and most of the other userspace tools, must operate directly on the pool_data device.

Thanks for your kind words, and I'm glad it's working out for you.