Too long cache uptime makes unusable on pacemaker cluster

isardvdi commented 6 years ago

Hi,

We are using writeboost cache with pacemaker on a cluster. We did an OCF writeboost resource that mainly runs writeboost on start and writeboot -u on stop. The cache is created on two raids (one of hdds and one of nvmes) replicated over drbd on two storage nodes.

The cache works well but when the node that have writeboost activated is fenced, the cluster writeboost resource will automatically start writeboost on the other node. The data on cache storage (nvmes) and on layering hdds is ok as it is always replicated through drbd, but the reconstruct process of writeboost could take too long as expected.

Is there any possibility to get writeboost get quickly activated assuming (forcing?) that the data on cache disks and layering disks have integrity?

akiradeveloper commented 6 years ago

@isardvdi

The cache is created on two raids (one of hdds and one of nvmes) replicated over drbd on two storage nodes.

I couldn't see how you design the system. If we write the writeboost device as follows (writeboost hdd ssd), you have a two writeboost devices (writeboost hdd0 ssd0) and (writeboost hdd1 ssd1) on each node (forget about raid under hddN and ssdN) and you are replicating hdd0 and hdd1 and ssd0 and ssd1 over drdb right?

isardvdi commented 6 years ago

Hi,

We have created a setup like this one:

+---------------------+                      +-------------------+
|   CLUSTER: NODE A   |                      |  CLUSTER: NODE B  |
|                     |                      |                   |
|                     |                      |                   |
|   nvme1n1+          |                      |  nvme1n1+         |
|          |          |                      |         |         |
|          +-->md1 <--------+ drbd109  +------->       +-->md1   |
|          |          |                      |         |         |
|   nvme2n2+          |                      |  nvme2n2+         |
|                     |                      |                   |
|                     |                      |                   |
|   sdb+---+          |                      |  sdb+---+         |
|          |          |                      |         |         |
|          +-->md0  <-------+ drbd100  +------>        +-->md0   |
|          |          |                      |         |         |
|   sdc+---+          |                      |  sdc+---+         |
+---------------------+                      +-------------------+

And we created the writeboost cache as follows on NODE A: wbcreate --reformat --read_cache_threshold=127 --writeback_threshold=80 storage /dev/drbd100 /dev/drbd109

As we are using ext4 filesystem our cluster runs writeboost only in one node. that creates the storage cache device on that node that it is exported as an NFS resource.

This is /etc/writeboosttab: storage /dev/drbd100 /dev/drbd109 writeback_threshold=80,read_cache_threshold=127

We are now doing tests and if we shutdown the node A handling the writeboost the command 'writeboost -u' will layer down the data correctly and our writeboost ocf cluster resource will wait for that to finish and then start it again (quite quickly) on node B.

The problem is if we force a fence to node A handling writeboost (as if a power failure has happened), then writeboost command will be run by or writeboost cluster resource on node B but time to set it up can take too much as the cache has to be reconstructed again. More or less the test we are doing is the same as if we had dm-writeboost on only one machine, shut it down and start again. The only difference is that we will run it now in another computer, but data will be the same that was running on the shutted down node.

We have other storage servers with EnhanceIO + Pacemaker and that cache can start again in no time assuming the data on nvme and hd are in sync. Is it possible to bring dm-writeboost cache up in less time?

akiradeveloper commented 6 years ago

@isardvdi

Syncing md0 and md1 independently looks nonsense to me because drdb is intrinsically async. In this system, you may result in having a old md0 and latest md1 after failover, which is inconsistent and the fs is typically broken.

The consistency should be maintained by writeboost driver. My recommendation is to make two writeboost devices on each node. Let's call them wb0 = (writeboost ND0/md0 ND0/md1) and wb1 = (writeboost ND1/md0 ND1/md1). Then what you are going to sync are wb0 and wb1 via dbdb. Next, mkfs on ND0.

This way, wb1 has just an old state of wb0 and both state is consistent as fs.

isardvdi commented 6 years ago

That makes sense! We are trying to create that scenario and do some failover tests. We will report if it works as expected.

isardvdi commented 6 years ago

Thanks for the advice. We set it up as you said and it worked like a charm. We set up a writeboost cache device on each drbd9 node and a pacemaker cluster over all them. When we fence a drbd9 node and reboot it again, it will take a long time to set up writeboost and we had to modify a little bit the writeboost.service so we make the drbdmanaged.service wait for writeboost to finish.

Here it is the modified writeboost.service we are using: http://isardvdi.com/thedocs/storage/cache/writeboost/

Thanks a lot!

akiradeveloper / dm-writeboost

Too long cache uptime makes unusable on pacemaker cluster #193