High transaction sync latency leading to ZFS "deadman" warnings

alaric728 commented 2 years ago

Howdy!

I've got a bit of an odd one for you!

I'm doing something like:

zfs send -L -c -i local_pool/fs@snap1 local_pool/fs@snap2 | zfs receive -s -F -d s3pool

Where I've got the s3pool mounted locally with these parameters (access key, etc. redacted): --size=10T,--blockSize=1M,--listBlocks,--ssl,--debug,--debug-http,--directIO

I'm experiencing pretty massive iowait within ZFS when sending bulk data through the bucket. The write speed appears to be about 100 MB/s actual, but according tools like pv the stream between the two programs is about 1GB/s. When I write more than 30 GB of data (the amount that I can write in <6 minutes, which will become apparent shortly), zed starts blowing up the logs with deadman timeouts:

Feb 16 17:47:06 host zed[26466]: eid=185323 class=deadman pool='s3pool' vdev=file size=1048576 offset=929096007680 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86276
Feb 16 17:47:06 host zed[26477]: eid=185324 class=deadman pool='s3pool' vdev=file size=1048576 offset=929094959104 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86275
Feb 16 17:47:06 host zed[26489]: eid=185325 class=deadman pool='s3pool' vdev=file size=1048576 offset=929093910528 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86274
Feb 16 17:47:06 host zed[26497]: eid=185326 class=deadman pool='s3pool' vdev=file size=1048576 offset=929092861952 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86273
Feb 16 17:47:06 host zed[26510]: eid=185327 class=deadman pool='s3pool' vdev=file size=1048576 offset=929091813376 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86272
Feb 16 17:47:06 host zed[26521]: eid=185328 class=deadman pool='s3pool' vdev=file size=1048576 offset=929090764800 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86271
Feb 16 17:47:06 host zed[26531]: eid=185329 class=deadman pool='s3pool' vdev=file size=1048576 offset=929089716224 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86270
Feb 16 17:47:06 host zed[26541]: eid=185330 class=deadman pool='s3pool' vdev=file size=1048576 offset=929088667648 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86269
Feb 16 17:47:06 host zed[26556]: eid=185331 class=deadman pool='s3pool' vdev=file size=1048576 offset=929087619072 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86268
Feb 16 17:47:06 host zed[26566]: eid=185332 class=deadman pool='s3pool' vdev=file size=1048576 offset=929086570496 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86267
Feb 16 17:47:06 host zed[26576]: eid=185333 class=deadman pool='s3pool' vdev=file size=1048576 offset=929085521920 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86266
Feb 16 17:47:06 host zed[26591]: eid=185334 class=deadman pool='s3pool' vdev=file size=1048576 offset=929084473344 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86265
Feb 16 17:47:06 host zed[26604]: eid=185335 class=deadman pool='s3pool' vdev=file size=1048576 offset=929083424768 priority=3 err=0 flags=0x184880 bookmark=75:11:0:86264

I'm probably barking up the wrong tree here, but is there a config setting I can use to limit the number of on-the-fly writes s3backer allows, or do I need to look to getting ZFS to back off here?

HaleTom commented 2 years ago

I'm also interested in running ZFS on top of s3backer.

I didn't see you listing --blockCacheFile, but if you did happen to redact it, did you look at --blockCacheThreads?

HaleTom commented 2 years ago

Also, take a look at the deadman tunables:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html?highlight=deadman#zfs-deadman-checktime-ms

archiecobbs commented 2 years ago

Sorry, I'm not familiar with "zed" and don't know the meaning of "deadman timeout".

But using a block cache could definitely help (or hurt).

alaric728 commented 2 years ago

Thanks for the quick responses! I'll have alook at the blockcache settings today. I'm hesitant to mess around too much with the deadman tunables, as I've got other local storage that I don't want to mess with too much.

alaric728 commented 2 years ago

After a good amount of testing today, it appears that the deadman warnings may be difficult to avoid, but it appears that adding the directIO + cache settings helps to prevent catastrophic data loss!

Before adding new arguments, I triggered a lot of IO, waited until the deadman warnings started to appear, then hard-powered off the machine. Invariably, this led to the zpool corrupting and more or less hosing all of zfs/zpool commands.

I added --directIO --blockCacheSize=1 --blockCacheThreads=4 and it caused transfers to decay faster from 1gbs to their actual speed of around 40mbps. Interestingly, it has also caused my repeated hard power resets to stop immediately corrupting the pool! I'm not 100% the issue is resolved, but I think I'm on the right track.

alaric728 commented 2 years ago

Are there any other settings I should look to for making s3backer extremely consistent? Performance isn't really a concern in my use-case.

archiecobbs commented 2 years ago

Wanting consistency but not performance makes you an outlier :)

If you add --blockCacheSync then the kernel does not receive "success" until the block has actually been written to S3 and a successful HTTP reply received back. You probably also want --blockCacheWriteDelay=0 if you do this. Then you can make your block cache as big as you want, because it's only going to contain clean data.

You could also just get rid of the block cache altogether. That would be done via --blockCacheSize=0. But this would mean every read requires network I/O.

HaleTom commented 2 years ago

Ah, I assumed that the tunables would be per ZFS pool, but sadly they're for the whole kernel module :(

If you made sure that the total size of blocks inside the block cache was less than the size of a ZFS transaction group (TXG), you'd likely be able to enable multiple workers with non-abysmal write performance and quite good on-storage consistencey. ZFS expects TXGs to be atomic. You'd likely lose at most one TXG.

Or at worst, you'd have maybe one or two TXGs that needed fscking if a particular block of TXG (x - 2) hung and TXG (x-1) was fully written meanwhile.

Just a thought. I'm not sure how to determine the size of a TXG, I think the only guarantee is the suggested record size, but even that may be ignored.

Did you get a decent recovery when you fscked after pulling the power plug?

If you're on Linux, you could use bcache as a large local write-back cache for excellent performance, and a small s3backer cache for some paralleism, or set --blockCacheSync to be safest.

HaleTom commented 2 years ago

You probably also want --blockCacheWriteDelay=0 if you do this.

@archiecobbs the man page says regarding --blockCacheSync that it requires --blockCacheWriteDelay=0. Are there some exceptions to that?

alaric728 commented 2 years ago

Sadly, no fsck exists on ZFS afaik, and the one installed by ZFS is just /bin/true ;)

After some testing today with --blockCacheSync --blockCacheWriteDelay=0, it appears turning on these settings doesn't play nicely with ZFS' multihost protections. Invariably, my pools would suspend due to MMP writes hitting their timeout without completing. Adding --directIO allowed me to get the pool importable, but once I registered a modicum of I/O, around 45 MBs, the pool would immediately suspend. MMP writes reported delays of 30 seconds-ish originally, but no matter what responsible value I set the timeout to (no more than a minute), I couldn't get it to be happy and it would suspend. Unfortunately I don't have the liberty of turning multihost off in my deployment :(

With regards to matching TXG size to cache size, I think you'd have to look at ZFS' configurations around dirty data maximums, as the only txg settings I'm familiar with all center around time and not space which probably wouldn't be of much help here.

archiecobbs commented 2 years ago

@archiecobbs the man page says regarding --blockCacheSync that it requires --blockCacheWriteDelay=0. Are there some exceptions to that?

No you're correct - I forgot about that check.

archiecobbs commented 2 years ago

s3backer now supports NBD mode and others have been doing some testing with ZFS.

Please try out the current master branch with the --nbd flag and see if things are working better.

archiecobbs commented 4 months ago

Closing old issue. Feel free to add more comments if new information becomes available.

archiecobbs / s3backer

High transaction sync latency leading to ZFS "deadman" warnings #172