akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
120 stars 18 forks source link

Ordered writeback #210

Closed schefi closed 4 years ago

schefi commented 4 years ago

Dear Akira,

Thank You for this amazing project. I would like to asure myself regarding the nature of dm-writeboost before I deploy it in the cluster of the University I work for. My main question is about the write order and consistency dilemma. So far I have read the documentation and some open issues and concluded the following. Please correct me if at any point I'm wrong!

The RAM buffer accepts the write requests and collects them in order of arrival into segments of 512KiB size to reduce the wear of the flash cells. There is a volatile buffer in RAM which may be lost in case of a power outage, but as much as I could get it from the code, dm-writeboost honors write barriers, so it will flush the RAM buffer onto the SSD, whenever requested by the upper layer, so FS consistenncy is preserved so far. In normal operation, the SSD log is written in 512KiB chunks (called segments) where each segment may contain multiple write requests of different sizes to different block addresses, while preserving the order of the request within the segment. Now the pitfall I'm suspecting in my use case is, that even if I set the nr_max_batched_writeback to 1, the write request are still going to be rearranged to ascending block number order within a single segment and then written back to the backing storage. Am I getting this right? When interrupted suddenly, is this going to cause inconsistency in the backing store? Would there be a way to support an additional parameter, which when set to true would skip the sorting, and write everything in original order?

I know that in case of local storage you can replay the log and that would remove any inconsistencies, but my use case is a bit different. The reason I'm asking this question is the following: We use a clustered storage for VM block devices, which are passed to one specific hypervisor node in the cluster to run the VM. Once the VM is powered off, it may be scheduled to a different physical node next time it runs, so each node has to have access to the most recent version of a specifc VM block device when it needs to start it. So far we used simple clustered storage for this, which we all know is not good when it comes to write latency. My idea is to put dm-writeboost in between the VM block device and the clustered block device. That way, with a tolerable amount of delay all the writes will eventually be propagated back to the clustered storage as writeback happpens, while VMs could still run quite speedy thanks to read and write caching in dm-writeboost. In case of a sudden interruption of the hypervisor itself or the storage network, the backing device will be behind the most recent state, some data will clearly be lost, but that is not important as in only the data written in the past few seconds or tens of seconds will be lost to the cluster. It is acceptable that users redo whatever they were doing in the VM in the last few seconds or minutes before hypervisor node failure (We expect this not to happen every day). What is more important in my intended use case is that the filesystem should be in an error free state (an unfinished journal entry is acceptable), so that the same VM can resume its normal operation on another node without manual filesystem repair. Hence the question: is it possible the perfectly honor the original write order when doing the writeback? If so, I think that a journaling file system would easily recover of such a scenario, as what would happen to the VM would not be more serious than a simple power loss in a regular PC. A sudden power loss doesn't break modern filesystems that easily, hence the VMs would be bootable and usable on other nodes without any further manual intervention. Or am I overly paranoid about preserving the write order within segments?

A related question is whether adding discard support to dm-writeboost would be possible? Of course in the event of a discard bio, all pending writes had to be flushed to backing store first, then execute the discard. After the discard everything could resume to normal. If there would be an option for strictly ordered writeback than putting a discard bio in this strict queue would not threaten any data still pending to be written back in the future.

Sorry for the long post, and thank you for your time! schefi

akiradeveloper commented 4 years ago

@schefi Quickly answered the first part:

The RAM buffer accepts the write requests and collects them in order of arrival into segments of 512KiB size to reduce the wear of the flash cells. This understanding is wrong.

There can happen overwrite in RAM buffer (both in case of partial write and 4KB overwrite) and the ordering is forgotten and it is ok as I write later.

Am I getting this right? Yes.

The written data are reordered within a segment.

When interrupted suddenly, is this going to cause inconsistency in the backing store? No.

Write barrier is the only way to make sure the precedingly returned write requests are made persistent in the virtual device: when write barrier is returned the write data is now safe and otherwise nothing about the consistency is guaranteed. If there causes some interrupts like battery failure and some write data in the segment is written in the HDD and others are not, still the SSD has the data and never discard the data before the segment is fully written back. The only case you may cause inconsistency is when SSD is broken but that is the same with other caching drivers and writeboost is still more likely to be consistent in this disastrous case because it manages to write back the data in “almost” in the same order as it was written, and may be recoverable in the FS layer. But again, it is not guaranteed and is matter of luck.

Would there be a way to support an additional parameter, which when set to true would skip the sorting, and write everything in original order? No.

Skipping sorting for consistency does not mean anything but if your intention is reducing the cpu usage and letting the sorting done in the backing store it may be arguable.

akiradeveloper commented 4 years ago

Regarding discard, there was a discussion in the past https://github.com/akiradeveloper/dm-writeboost/issues/110

schefi commented 4 years ago

Thank you for the quick answers. I see how discard is difficult, and I understand now why you have disabled the feature.

Regarding the consistency question, I have to make a small adjustment to my original question. I understand that the current consistency model is competely safe across sudden reboots as long as you have the SSD logs available when you reboot and reconstruct the writeboost block device. But I'm most curious about the consistency of the backing device itself, this is why I ask about the ordering. Let me explain: let's assume we have a catastrophic hypervisor node failure, smoke comes out of it, it totally breaks, in a way that it takes several days to get spare parts and get it back online. This also means I will not have access to the SSD logs for days at least. So my question is in fact about a scenario, as You yourself pointed it out, when we suddenly lose the SSD logs. What I would like to have in case of such a disaster is to start the same VM almost immediately in a different hypervisor node with the last state that was written back to the backing device (clustered storage in my case), but whithout the contents of the SSD logs. I understand that there will be lost data, but what I care about much more is consistency. Specifically that the VM would be able to boot and work from the last writeback state without doing fsck (an implicit journal replay is OK, it happens automatically on mounting anyway).

Strictly ordered writeback (as I belive it currently) would guarantee that the filesystem on the backing store itself never gets more inconsistent or more damaged, than a regular single disk PC would do after a power loss. In light of such a specific use case would it be worth contemplating to provide a configurable option which turns off reordering/sorting or am I on the wrong track with this idea?

akiradeveloper commented 4 years ago

This suddenly reminds me of this user case which might help you: http://thedocs.isardvdi.com/

I will answer to your specific scenario later.

schefi commented 4 years ago

Thank you for pointing me to isardVDI. I have experience myself with DRBD and other solution mentioned on that site. One of the reasons I gave up on DRBD is exactly the thing we are discussing right now. DRBD has a volatile buffer for write spikes, I think it can be configured up to 80M, any greater buffer requires you to buy the propriatery DRBD proxy. Now this write buffer is completely volatile in the first place, so it imposes an even greater risk in case of node failure. But let's put it away for a minute. If we don't count the write buffer, DRBD either uses synchronous replication, which guarantees consistency but suffers from the big write latencies mentioned before, like any other clustered or distributed storage. OR it can turn to asynchronous resync mode, but in that mode writes are completely reordered in the scope of the whole block device. It uses a bitmap to track changed blocks (differences between primary and secondary node) and then updates them in block ascending order. There is a reason why the IsardVDI engineers suggested to use either enhanceio or dm-writeboost as an accelerator layer on top of the otherwise not so performant DRBD block device. But I have read in this very issue tracker somewhere, that enhanceio uses optimizations that are unsafe, and in case of failure it generaly causes more fsck inconsistencies than your respected dm-writeboost solution.

So either the IsardVDI solution is exposed to the same risk I brougth up today, or the risk of getting a filesystem corruption in practice is so negligable even with the current writeboost sorting implementation, that I should not care, and go ahead with it. Maybe I'm just overly cautious about this. As a third option, we could argue at least in theory, if having an in-order writeback mode makes this risk go away or not. Many thanks for taking my thoughts into consideration.

akiradeveloper commented 4 years ago

We use a clustered storage for VM block devices, which are passed to one specific hypervisor node in the cluster to run the VM. Once the VM is powered off, it may be scheduled to a different physical node next time it runs, so each node has to have access to the most recent version of a specifc VM block device when it needs to start it. So far we used simple clustered storage for this, which we all know is not good when it comes to write latency. My idea is to put dm-writeboost in between the VM block device and the clustered block device.

I could't surely see what is doing here. You have a shared clustered storage and you want to add independent SSD caching onto each nodes?

Then my question is why don't you use SSD caching in clustered storage?

If you want to do that way, you should set up clean SSD caching by setting write around mode.

akiradeveloper commented 4 years ago

Of course, if the writeback is ordered there are many advantages. but it is just a fantasy. Let's suppose we have a sequence of side-effects consisting of writes (W) and flushes (F) and it is ordered so it can reproduce the state of the storage by applying them in order. Let's denote it like WWWWFWWWF....

What we have to do here is apply the first four W atomically. This is not possible in general. You may think ordered writeback is possible by writing back the Ws one by one but this makes everything synchronous FUA writes and performs very poorly.

Considering the trade-off, I design writeboost as it allows overwrites in RAM buffer, sorts in segments and then sends them all back to backing store asynchronously. That way, if there is no power failure in-between the asynchronous writes the all write back data hit the backing store then they are applied in the backing store eventually. (we may assume the backing store is perfectly durable because it is out of our scope)

schefi commented 4 years ago

I could't surely see what is doing here. You have a shared clustered storage and you want to add independent SSD caching onto each nodes?

Then my question is why don't you use SSD caching in clustered storage?

If you want to do that way, you should set up clean SSD caching by setting write around mode.

I'm sorry, I see my error now. Looking at http://thedocs.isardvdi.com/setups/ha/active_passive/ again makes me realize, that they did what you just suggested, and not what I wrote earlier. There is an ASCII graphics behind the link that expalins it. The DRBD device is on top of the caching layer, and they also replicate the contents of the cache over the cluster. But I don't see the use of it (at least in case of write caching) this way, because networked storage is always going to be a seroius bottleneck for writes as it needs confirmation from the peer node or the quorum of other nodes whether they have accepted the write. Therefore this setup is very sensitive to network turn-around time, and even with a single network switch it is going to be slower than a general local mechanical hard drive for a set of small writes. So the question is why accelerate storage node writes with SSD on the storage node itself if the write requests are going to be bottlenecked by the network anyway. So the solution lined out is safe after all (I was mistaken first, sorry), as the SSD cache is also replicated to the cluster in IsardVDI, but than again this will not really help performance.

To answer your other question: Yes, I wan't to add individual SSD cache to the compute nodes, which sounds insane at first, but it isn't. Because in this specific use scenario there is always only one node to read and write a specific image file or VM block device at a time. So you won't have stale cache. Actually the cache device is not even present on inactive nodes. What I intend, is to add writeboost device on a specific compute node only before it launches a VM, and tear it down with proper flushing once the VM shuts down. I don't want the VM to sense the high write latencies of the clustered storage, instead I want the compute node to collect writes in a fast log, then write back the changes in-order to the clustered storage at whatever speed it can, in the background. It doesn't matter if the writeback is slow, it only matters to propagate the changes from the active node to the distributed storage in the same order the write requests hit the SSD log. And there is nothing wrong with this scenario as long as I don't have a failure, like for e.g. the compute node separating from storage network, or the compute node fails completely before writing out the full contents of the SSD log. This will not happen very often, so I think I might even give it a go the way it is now. I just wanted to try and ask your opinion about making this safer with the option to select in-order writeback just for the extreme case of log-loss.

akiradeveloper commented 4 years ago

What you want here is not caching but some additional journaling layer (or something like IO replay mechanism). Writeboost is just a highly performant caching layer which is just utilizing log-structuring for performance and durability, which is far from what you want. Then there is no chance of implementing the feature in this module. If you want to achieve your goal, I recommend you to implement it by yourself.