[enhancement] make holes in cache file for zero blocks with fallocate

xmrk-btc commented 1 year ago

I noticed that while zero blocks are not written to remote storage, they are written to the block cache. This is not optimal. For example I am using s3backer as one mirror vdev of my ZFS pool, and cache file on SSD. Those writes shorten the life of SSD and they slow down the ZFS pool, especially the initial sync (doing zpool attach).

So I would like to punch holes in the cache file using fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,...). I already made the minimal changes and am testing it, I have few issues or questions.

fallocate only works on Linux, this would have to be handled in configure, but I am not familiar with autoconf stuff
I guess we can make zero_cache optional, and perhaps turn it off if using fallocate and caching the whole remote bucket. Note that I am caching the whole bucket, so I am not concerned about evicting non-zero blocks, but other users may want to avoid it. Basically, with zero_cache you cache N non-zero blocks and all zero blocks, with fallocate and no zero-cache you would cache N blocks (zero or non-zero). The cache file would occupy less (no or little space needed for zero blocks), but would also cover less data.
fallocate could be turned on automatically (if available). The only drawback I see is that the block would be tested twice whether it is zero, once in zero_cache and then in block_cache.
How likely is this to be merged?

archiecobbs commented 1 year ago

See if 3b2c19a works...

xmrk-btc commented 1 year ago

going to test your version, just a few notes.

I already tried adding fallocate, and got errors when reading the last block (something like "invalid argument"), and s3backer got stuck. I added ftruncate to ensure the that the cache file can contain the currently written block (only enlarging the cache file, not shrinking), and the error did not repeat. I think this is because pwrite silently enlarges the file if writing past its end, and fallocate with KEEP_SIZE doesn't.
we can do better, and scan the written block for sub-blocks of zeros, I actually implemented this, will push my version on github if my git skills permit.
it does not solve the problem of slow initial syncronization of ZFS mirror. Some numbers as reported by zpool iostat: writing 500 kB/s when syncronizing (and block cache is not fully initialized yet), and 5 MB/s when the cache is full. Block size 512 kB.

xmrk-btc commented 1 year ago

Please see cac2938ea814c0a2519df5343828a51cd3597286 for holes for zero sub-blocks - just to show the idea, my diff may need some cleanup. It conceptually splits the block into sub-blocks of MIN_HOLE_SIZE bytes, and does either pwrite or fallocate for each sub-block as appropriate. But it joins consecutive sub-blocks into 1 request, so if we have say 5 consecutive zero-filled sub-blocks (and the next sub-block is non-zero), it does 1 fallocate for those 5 sub-blocks. You can also see s3b_dcache_ensure_file_size, which resolved my read errors I mentioned.

xmrk-btc commented 1 year ago

Just got the error as I predicted:

error reading cached block! Invalid argument

I was running zpool trim to see if it happens.

# strace -f -e trace=fallocate,pread64 -p $(pgrep s3backer)
strace: Process 2803849 attached with 15 threads
[pid 2803853] pread64(3, "\3\0\0\0\0\0\0\200\205\310\273\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288, 11765092352) = 282624
[pid 2803853] pread64(3, "", 241664, 11765374976) = 0
[pid 2803853] pread64(3, "\3\0\0\0\0\0\0\200\205\310\273\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288, 11765092352) = 282624
[pid 2803853] pread64(3, "", 241664, 11765374976) = 0

# ls -l /mnt/dreamhost_cache/cachefile
-rw-r--r-- 1 root root 11765374976 Jan 15 13:21 /mnt/dreamhost_cache/cachefile

So it is reading just at the end of the cache file.

archiecobbs commented 1 year ago

OK - above comments except for splitting up writes into sub-ranges addressed in 394c842.

I will look at that next.

archiecobbs commented 1 year ago

See if c8343a2 looks right.

jjaakkol commented 7 months ago

I think this feature led to corruption in my use case. If the block cache is in NFS filesystem, accessing the freshly unallocated blocks leads IO errors and corruption. I noticed this by running zpool trim on my filesystem. Also, I don't think this is useful since:

Freeing blocks is a metadata operation, so you get metadata writes somewhere else in your SSD.
All modern SSD:s can handle zero byte blocks in the firmware and don't actually write the zero bytes anywhere. This is a thing that can be and is handled by the SSD.
In year 2024 it is likely that your filesystem already handles zero blocks with compression or deduplication.
A busy cache will reuse the filled block after a while anyway, which will cause more metadata operations when the block is reused and reallocated. So instead of just reusing a block (before it gets flushed to disk if you are lucky), you still get the same amount IO and some excess metadata IO somewhere else.
The excess IO is metadata IO, which is likely to cause sync IO flushes, where plain data IO is likely to be kept in write caches.
This causes corruption, at least when the block cache is in Linux NFS file system. Corruption is bad, even if this a NFS bug.

I suggest to make this feature optional and turned off by default. Or even better, make zero block tracking happen inside the s3backer block cache with a separate zero block bitmap or a journal, so block cache IO could be skipped completely. That would be portable too.

archiecobbs commented 7 months ago

I think this feature led to corruption in my use case. If the block cache is in NFS filesystem...

Hmm... where do you think this corruption is coming from? If you think it's coming from s3backer, then it shouldn't matter whether NFS is being used or not.

Do you have a minimal reproducible test case, e.g., not involving NFS?

Thanks.

archiecobbs / s3backer

[enhancement] make holes in cache file for zero blocks with fallocate #200