Closed alaric728 closed 4 months ago
I'm also interested in running ZFS on top of s3backer
.
I didn't see you listing --blockCacheFile
, but if you did happen to redact it, did you look at --blockCacheThreads
?
Also, take a look at the deadman
tunables:
Sorry, I'm not familiar with "zed" and don't know the meaning of "deadman timeout".
But using a block cache could definitely help (or hurt).
Thanks for the quick responses! I'll have alook at the blockcache settings today. I'm hesitant to mess around too much with the deadman tunables, as I've got other local storage that I don't want to mess with too much.
After a good amount of testing today, it appears that the deadman
warnings may be difficult to avoid, but it appears that adding the directIO + cache settings helps to prevent catastrophic data loss!
Before adding new arguments, I triggered a lot of IO, waited until the deadman warnings started to appear, then hard-powered off the machine. Invariably, this led to the zpool corrupting and more or less hosing all of zfs/zpool commands.
I added --directIO --blockCacheSize=1 --blockCacheThreads=4
and it caused transfers to decay faster from 1gbs to their actual speed of around 40mbps. Interestingly, it has also caused my repeated hard power resets to stop immediately corrupting the pool! I'm not 100% the issue is resolved, but I think I'm on the right track.
Are there any other settings I should look to for making s3backer extremely consistent? Performance isn't really a concern in my use-case.
Wanting consistency but not performance makes you an outlier :)
If you add --blockCacheSync
then the kernel does not receive "success" until the block has actually been written to S3 and a successful HTTP reply received back. You probably also want --blockCacheWriteDelay=0
if you do this. Then you can make your block cache as big as you want, because it's only going to contain clean data.
You could also just get rid of the block cache altogether. That would be done via --blockCacheSize=0
. But this would mean every read requires network I/O.
Ah, I assumed that the tunables would be per ZFS pool, but sadly they're for the whole kernel module :(
If you made sure that the total size of blocks inside the block cache was less than the size of a ZFS transaction group (TXG), you'd likely be able to enable multiple workers with non-abysmal write performance and quite good on-storage consistencey. ZFS expects TXGs to be atomic. You'd likely lose at most one TXG.
Or at worst, you'd have maybe one or two TXGs that needed fscking if a particular block of TXG (x - 2) hung and TXG (x-1) was fully written meanwhile.
Just a thought. I'm not sure how to determine the size of a TXG, I think the only guarantee is the suggested record size, but even that may be ignored.
Did you get a decent recovery when you fsck
ed after pulling the power plug?
If you're on Linux, you could use bcache
as a large local write-back cache for excellent performance, and a small s3backer
cache for some paralleism, or set --blockCacheSync
to be safest.
You probably also want
--blockCacheWriteDelay=0
if you do this.
@archiecobbs the man page says regarding --blockCacheSync
that it requires --blockCacheWriteDelay=0
. Are there some exceptions to that?
Sadly, no fsck
exists on ZFS afaik, and the one installed by ZFS is just /bin/true ;)
After some testing today with --blockCacheSync --blockCacheWriteDelay=0
, it appears turning on these settings doesn't play nicely with ZFS' multihost protections. Invariably, my pools would suspend due to MMP writes hitting their timeout without completing. Adding --directIO
allowed me to get the pool importable, but once I registered a modicum of I/O, around 45 MBs, the pool would immediately suspend. MMP writes reported delays of 30 seconds-ish originally, but no matter what responsible value I set the timeout to (no more than a minute), I couldn't get it to be happy and it would suspend. Unfortunately I don't have the liberty of turning multihost off in my deployment :(
With regards to matching TXG size to cache size, I think you'd have to look at ZFS' configurations around dirty data maximums, as the only txg settings I'm familiar with all center around time and not space which probably wouldn't be of much help here.
@archiecobbs the man page says regarding --blockCacheSync that it requires --blockCacheWriteDelay=0. Are there some exceptions to that?
No you're correct - I forgot about that check.
s3backer now supports NBD mode and others have been doing some testing with ZFS.
Please try out the current master
branch with the --nbd
flag and see if things are working better.
Closing old issue. Feel free to add more comments if new information becomes available.
Howdy!
I've got a bit of an odd one for you!
I'm doing something like:
zfs send -L -c -i local_pool/fs@snap1 local_pool/fs@snap2 | zfs receive -s -F -d s3pool
Where I've got the s3pool mounted locally with these parameters (access key, etc. redacted):
--size=10T,--blockSize=1M,--listBlocks,--ssl,--debug,--debug-http,--directIO
I'm experiencing pretty massive iowait within ZFS when sending bulk data through the bucket. The write speed appears to be about 100 MB/s actual, but according tools like pv the stream between the two programs is about 1GB/s. When I write more than 30 GB of data (the amount that I can write in <6 minutes, which will become apparent shortly), zed starts blowing up the logs with deadman timeouts:
I'm probably barking up the wrong tree here, but is there a config setting I can use to limit the number of on-the-fly writes s3backer allows, or do I need to look to getting ZFS to back off here?