dm-vdo / vdo

Userspace tools for managing VDO volumes.
GNU General Public License v2.0
192 stars 32 forks source link

Why is VDO so Slow ? #38

Closed geremi1 closed 2 years ago

geremi1 commented 3 years ago

When I am writing files to a filesystem (btrfs) on VDO, it is VERY slow, writing at about 3 MiB/s. Yet, the NVMe SSD is able to write at much faster speed. Here is a summary of the configuration I'm using :

With that in mind, what may I modify to speed things up (possibly 2.5GiB/s, as without VDO in the chain) ?

rhawalsh commented 3 years ago

Hi @geremi1 Thanks for reaching out.

That is very curious why it is so extremely slow. I don't think there is quite enough information to make an educated guess yet, however. I have a few questions that I was hoping you might be able to answer. And if there are other thoughts you have around the configuration/test, please feel free to share.

  1. What does your test actually do? Is it something like fio writing to a file inside btrfs? If you're using fio, would you mind sharing the configuration? Or are you just copying data from one place to another? If you're just copying data, what does that data look like?
  2. What did your test without VDO look like? Storage stack configuration, and the actual test performed as well.
  3. I think it would be useful to get outputs of things like vdostats --verbose and any other specifics for configuration that you used.
  4. I think it would also be useful to see what iostat -dmx 1 reports while you're writing the data. This would help us understand whether the underlying storage bandwidth is fully saturated or if there is something else that stands out.
geremi1 commented 3 years ago

Hi @rhawalsh. I decided to test with my external HDD and the exact same slowness is present. Here are my answers :

  1. I simply copy-paste directory with cp -r linux-5.11.8/ /mnt/dst/, the destination being on the btrfs on VDO filesystem. The data is this untar'ed linux kernel source directory.

  2. It is the same test I use on btrfs with this stack configuration : Partition / LVM / crypt luks2 / LVM / btrfs. The actual performance of this default btrfs mount is 134 MiB/s (on external HDD). Here is the output of iostat -dmx 1 when I'm copying the entire directory to the btrfs filesystem. (/dev/sda / LVM / luks2: dm-10 / LVM: dm-12 / btrfs: dm-13).

  3. Here is the output for vdostats --verbose once the directory is copied. There aren't any other configuration apart from the default for everything (in my tests).

  4. Here is the output for iostat -dmx 1 when copying the directory to the btrfs on VDO filesystem. It is much more loaded than the one without VDO. (/dev/sda / LVM / luks2: dm-10 / LVM: dm-12 / vpool0_vdata: dm-11 / vpool0-vpool: dm-13 / btrfs: dm-14)

We may see that /dev/sda is 100% at around 6 MiB/s, which is very strange. I hope you'll find out why it's slow.

raeburn commented 3 years ago

Hi, @geremi1... I’ve been trying to reproduce the sort of problem you describe, but without luck so far.

I’ve been experimenting with both fio and tar for writing to the file system, and in my setup (the physical storage is hardware raid0 of a few spinning disks), I’m seeing 150-300 MB/s. That’s btrfs on top of VDO on top of crypt... more specifically:


+ dmsetup ls
vdovg-xvdopool-vpool    (253:4)
vdovg-xvdopool_vdata    (253:3)
vdovg-xvdo  (253:5)
foovg-foovol    (253:1)
vdo0-lvm0   (253:0)
crypt0  (253:2)
+ dmsetup status
vdovg-xvdopool-vpool: 0 871245824 vdo /dev/dm-3 normal - online online 902345 217810944
vdovg-xvdopool_vdata: 0 1742487552 linear 
vdovg-xvdo: 0 871243776 linear 
foovg-foovol: 0 1742782464 linear 
vdo0-lvm0: 0 1742790656 linear 
crypt0: 0 1742749696 crypt 
+ dmsetup table
vdovg-xvdopool-vpool: 0 871245824 vdo V2 /dev/dm-3 217810944 4096 32768 16380 on auto vdovg-xvdopool-vpool maxDiscard 1 ack 1 bio 4 bioRotationInterval 64 cpu 2 hash 1 logical 1 physical 1
vdovg-xvdopool_vdata: 0 1742487552 linear 253:2 2048
vdovg-xvdo: 0 871243776 linear 253:4 1024
foovg-foovol: 0 1742782464 linear 253:0 2048
vdo0-lvm0: 0 1742790656 linear 8:17 2048
crypt0: 0 1742749696 crypt aes-xts-plain64 :64:logon:cryptsetup:830aea5d-c4b1-4c08-9f85-5f40e351f44a-d0 0 253:1 32768
+ fio --name=jobname --bs=4096 --rw=write --filename=/mnt/bigfile --numjobs=1 --size=10737418240 --direct=1 --unlink=0 --iodepth=128 --ioengine=libaio --scramble_buffers=1 --end_fsync=1
...

So far the performance looks okay. (We’d always like it to be higher, of course, but I’m not seeing anything like you describe.) I’ll continue to run some tests and see if I can come up with anything.

A couple possibilities do come to mind. First, is there any chance your machine is heavily loaded, or under-powered? VDO is fairly hungry for CPU and memory resources. Second, if you’re running VDO in a virtual machine, there are some recent driver changes to fix a high context switch rate (lots of timer interrupts) which might impose an extra load in a hypervisor environment, though I’m not aware of cases of it having a drastic throughput performance impact.

geremi1 commented 3 years ago

Hi @raeburn and @rhawalsh,

There is no chance the machine is heavily loaded nor under-powered (no under-voltage tricks) nor running inside a virtual machine. All VDO's IO threads doesn't even use 40% of 1 CPU while writing in any of my tests.

I tested again btrfs on VDO on my external HDD directly on the partition /dev/sda3 (/dev/sda / VDO : dm-12) without better results (still 5-6 MiB/s). This is on the latest (SHA1) Arch Linux's iso with the necessary tools installed to perform the test. (same untar'ed linux file).

Moreover, I tested it on my NVMe SSD (with Arch Linux's iso), which also gave slow results (about 54 MiB/s) compared to the native +500 MiB/s btrfs performance (on luks2 too). (with VDO: /dev/nvme0n1p3 / LVM / luks2: dm-0 / VDO: dm-4 / btrfs: dm-13) (without: /dev/nvme0n1p3 / luks2 / btrfs : dm-4)

What the problem seems to be then is when it writes and completely overloads the external HDD or SSD's write requests, as seen in columns "wrqm/s" of iostat -dmx 1 and "%util" being full at 54 MiB/s for the SSD with VDO, however not even touching 52% without it.

raeburn commented 3 years ago

I didn’t mean under-voltage tricks so much as not having enough CPU power or memory available. Given that you’re spending money on NVMe SSDs, I didn’t really expect CPU power to be in short supply, but it’s not a bad idea to check the obvious things first anyway just to be sure.

I’ve been doing more performance tests in one of our lab configurations (SSDs on PCI-attached RAID controller, all several years old). I do see a slowdown when VDO is in the stack (not surprising as it does a lot of work), but still not as large as you describe, in any of the configurations I’ve put together so far; maybe a factor of 2-3, not 9-10.

One other thing I note: At 54 MiB/s, an untar’ed linux source tree is probably only going to take some 20-30 seconds to write. In my experience, VDO performance tends to vary over the course of a test, and that’s a bit short to get a clear estimate of the average. We haven’t dug deeply into the reasons yet, though I’ve got some guesses. In any case, I try to run tests for at least a couple minutes or so, preferably longer.

Perhaps you could gather some more info for me while you’ve got a big write job running, so I can try to get a clearer picture of what’s happening inside the driver at the time? If you could, first, run “iostat -cdmtxy 1” during the test; that adds CPU load and timestamps to the report. Then, try running:

date ; dmsetup message vdo-pool-name 0 dump queues pools ; vdostats /dev/mapper/vdo-pool-name --all

where vdo-pool-name is the /dev/mapper/... name for the pool device. Do this a couple of times, about 10 seconds apart, while VDO is running slow. The “dmsetup message” command will cause the driver to dump a lot of info into the kernel log, so you’ll need to fetch the kernel log (with full timestamps, ideally) to send as well.

If the test write runs long enough, give the command line above twice more, a couple minutes or more after the first time. Maybe the evidence will suggest the same bottleneck both times, maybe it’ll look different…

Between the iostat info, vdostats report, and info dumped to the kernel log, and the timestamps to correlate them, I think that’s just about all the information we can easily pull out at the moment. If you can get those for me, I’ll try to figure out what kind of processing is being done in those time intervals, and what various parts of the driver might be waiting on, and see if anything looks unusual.

raeburn commented 3 years ago

Oh, and the "dmsetup ls" and "dmsetup table" output from the device stack setup you use for the test, please, to get the exact configuration and layout of VDO and its storage. Just the devices involved in the test.

geremi1 commented 3 years ago

Thank you @raeburn for your feedback. I fetched all the data you asked for and bundled it inside this file. It contains :

The test on the backup transfer lasted about 15 minutes on my external HDD, transferring only 8.21G on 71G and I waited for 11 minutes after it with sync to complete the transfer of what rsync did (10.5G total). I'm eager to learn what is happening.

raeburn commented 3 years ago

Thanks! Do you have the kernel log messages from the run too?

raeburn commented 3 years ago

Oh, sorry, I see now you included it in the vdostats text file...

raeburn commented 3 years ago

Were there any workQ reports in the kernel log after the first one? There should've been one such section each time the dmsetup message command was run.

geremi1 commented 3 years ago

Hi @raeburn, yes there are others at every 10 seconds. Here is the complete kernel log of the same test I performed without parsing (I don't know why it didn't match them).

raeburn commented 3 years ago

Ah, thank you! I suspect we overflowed the in-memory ring buffer, which is all that dmesg would show you. I was also trying to figure out why there were so few of the "kvio" entries in each block... I know it seemed like a lot, but I was expecting 2000! This will give a better picture of the driver state...

raeburn commented 3 years ago

I apologize for the time it’s taken me to get back to you, but I’ve been digging into a few areas here, and finding several things that could be contributing to the performance problem you’re seeing. Some are known issues in VDO, but it’s also brought to light some we weren’t aware of.

First off, I don’t think you’re going to get quite the same performance you’re seeing without VDO in the stack for SSDs. Between compression and deduplication, VDO does a lot of work, involving a lot of CPU cycles and sometimes some amount of additional I/O (data reads) if deduplication is successful, and this incurs a cost. All that said, I don’t think it should be as bad as you’re seeing.

A few things might help:

1) Zero out the VDO (“dd if=/dev/zero …”) before creating the file system

If not the whole device, then a size at least as big as you’re planning to use for the file system.

The test you’re doing runs afoul of some startup costs associated with a newly created and/or newly started-up VDO device, instead of testing the steady-state behavior.

VDO uses dynamically allocated metadata structures on disk for the logical-to-physical address mapping, and writing the zeros will force the allocations to get done. It also causes the metadata to get loaded into VDO’s cache. If you’re dealing with an existing VDO device and file system with content, reading the device with dd will also force the full address map metadata to get loaded into cache. (Actually, writing or reading one block about every 3 MB is all it takes to allocate the structures or load the cache, respectively. But unless you want to use a tool like “fio”, I think “dd” of every block may be the easiest and fastest way.)

It also turns out there’s some inefficient behavior we hadn’t noticed before in VDO where, when we allocate a new disk block for certain parts of the metadata, we issue a read for that location, even though we know there’s no data stored there yet. (I’ve filed a ticket on that; see below.) After the read completes, we initialize the data structure and go about our business.

Unfortunately, in the access pattern of your test (writing to lots of pages never written to before), the vdostats report (under “block map incoming pages”) show that most of the time, some of the I/Os going through VDO were waiting for this extra read to be done. Worse, the “dump” listings show that it was often 200+ I/Os that were likely waiting for it (“kvio…findBlockMapSlot -/attemptLogicalBlockLock”), which exaggerates that latency cost.

2) Use btrfs compression instead of VDO compression.

I found a significant speed increase using “mount -o compress” after creating the volume with “--config allocation/vdo_use_compression=0”. How much of this comes from the reduced work in VDO per block written (to VDO) and how much is from reducing the amount of data actually written to VDO per megabyte written to the file system is unclear, but between the two, you should still get the storage benefits of compression with improved throughput.

3) I/O scheduler experiments

We’ve found that sometimes changing the I/O scheduler used for a storage device (see /sys/block/sda/queue/scheduler for example) can improve VDO performance. Unfortunately it can be hit or miss, and we haven’t dug into the low level aspects of the issue yet.

4) Tell btrfs to avoid making redundant copies: mkfs.btrfs -d single -m single

VDO’s deduplication mechanism will remove the redundant copies and map both logical addresses to the same physical copy of the data. Having btrfs send two copies of its metadata just makes extra work for VDO to do.

I get a slight improvement in throughput (a few percent) when I change from the default of “-m dup” in a one-HDD setup.

5) Use smaller btrfs metadata blocks: mkfs.btrfs -n 4096

The default “nodesize” (metadata block size) for btrfs is 16 kB on x86. But VDO tracks data in units of 4 kB, and manages each separately. So they’ll get broken into four sub-blocks, each getting a new location assigned (hopefully consecutively), the contents hashed and checked to see if they match anything known, like the previous versions of the same 16 kB metadata chunk. If such matches are found, the 16 kB metadata block may not wind up in a contiguous block on disk, as the already-existing, matching 4 kB data blocks would be used when possible, and newly allocated locations otherwise.

If we tell btrfs to break metadata into 4 kB blocks, hopefully in many cases reads and writes can be cut down to just the data that’s needed, fewer blocks need to be transferred, and we’ll have fewer cases of trying to read chunks of metadata that btrfs thinks are sequential locations but are in fact scattered around.

Once again, this did seem to make a small difference in throughput in some of my testing.

6) Stop btrfs from trying to physically separate metadata: mkfs.btrfs --mixed

The “--mixed” option is documented as bad for performance on larger btrfs file systems, and in a quick HDD test I did without VDO in the stack, throughput dropped by almost half.

However, VDO’s data storage approach allocates a new block on each write (reclaiming it if the data block turns out to be a duplicate) without regard for the logical address, so btrfs’s attempt to allocate metadata and data grouped together in different physical regions is doomed anyway when running atop VDO.

Having btrfs allocate both from the same storage doesn’t seem to hurt performance when running on top of VDO, at least in my tests. In fact it could be a tiny bit better, though that could also be “--mixed” apparently implying “-m single”, as in item 4.

7) Use SSD storage as a write-back cache above VDO

The idea here would be to act as a large buffer for data going to VDO, to absorb bursts of high write activity and then let VDO catch up. Continuous high write activity (as in a long-duration throughput test) may still overwhelm it and negate the benefit eventually.

8) Use an SSD as write-back cache between VDO and HDD storage

We’ve suggested this to a few people but haven’t done much testing to see how it performs. The idea is to take VDO’s writes, which tend to be smaller and harder to group, and use another layer under VDO as a write-back cache (e.g., using dm-cache or dm-writecache through LVM, or bcache) to accumulate lots of updates on an SSD before writing back data in big chunks to the HDD.

An SSD can more efficiently absorb the more-random-access pattern of VDO’s writes and deal with cases where we go back to fill in gaps later. (This happens sometimes due to the interaction of our mostly linear initial allocation scheme, followed by not writing to the block if deduplication or compression lead us to optimize the block’s storage, and then we eventually reassign the location for another use.) Also, VDO issues a very high rate of flush operations to its underlying storage, which an SSD would be much more capable of handling efficiently than an HDD. (I’ve also opened a ticket to get this high flush rate investigated, and reduced if possible.)

Obviously this doesn’t help your SSD-only case, or an HDD-only case, but if you run VDO on a system with both types of devices available, a hybrid solution may be able to take advantage of the VDO space optimization without paying the full performance cost all the time.

9) Put recovery journal on SSD, everything else on HDD

This is another option to consider trying in a mixed SSD/HDD setup. VDO’s recovery journal metadata is a frequently-updated data structure stored near the end of the backing storage, to which all of the address-mapping updates are logged as we go along. Instead of caching writes to it in an SSD, if LVM is persuaded to build a logical volume mostly from the HDD but with the tail end stored in a segment on an SSD, so that the journal is permanently located in the faster storage, the remaining HDD accesses may have improved locality.


I’ve opened a couple of tickets on things I found VDO doing that looked rather inefficient during my testing of this case, but these are things the VDO team will have to look into at some point:

https://bugzilla.redhat.com/show_bug.cgi?id=1953781 - flushing data to stable storage very often, tens of times per second, even while getting no flush requests with the incoming data

https://bugzilla.redhat.com/show_bug.cgi?id=1953792 - issuing reads of metadata structures on disk when we’ve just allocated the space and it’s therefore known to be uninitialized (but we wait to read it anyway)

Githopp192 commented 3 years ago

Hey Andy, thanks again to heavely working for VDO. Some time ago, i did run Cloud-Services on a productive Environment (Centos based) with VDO. Now, i did migrate the Cloud stuff (VM,Datastores for os and data) to a system with BTRFS. (my primary goal was ZFS, but those systems have got not enough storage capacity over there).

My OS System on centos is "NVME" based boot storage with VDO on LVM (Andy already has got all my mails about :-) ), the data layer on normal SATA-SSDs. But for both storage layers with VDO i did see disk average wait time up 5000ms ! regulary.

I got the feeling after more than one year testing (or we can say.. it was production !), that:

I see big progress in the last couple of month - but - i think it's not enough, that VDO gets Enterprise stabililty - like ZFS or BTRFS. I know it's quite tricky, bringing all together - but Kernel Team, File-System Team,Volume-Management Team, Compression-Layer Team must become ONE TEAM, that they deliver a solution which is fully compatible within each layer of functionality.

May be this is / i've got only a small portion of the big portion - and for that i do apologize.