Closed xmrk-btc closed 5 months ago
See if 3b2c19a works...
going to test your version, just a few notes.
zpool iostat
: writing 500 kB/s when syncronizing (and block cache is not fully initialized yet), and 5 MB/s when the cache is full. Block size 512 kB. Please see cac2938ea814c0a2519df5343828a51cd3597286 for holes for zero sub-blocks - just to show the idea, my diff may need some cleanup. It conceptually splits the block into sub-blocks of MIN_HOLE_SIZE bytes, and does either pwrite or fallocate for each sub-block as appropriate. But it joins consecutive sub-blocks into 1 request, so if we have say 5 consecutive zero-filled sub-blocks (and the next sub-block is non-zero), it does 1 fallocate for those 5 sub-blocks. You can also see s3b_dcache_ensure_file_size, which resolved my read errors I mentioned.
Just got the error as I predicted:
error reading cached block! Invalid argument
I was running zpool trim
to see if it happens.
# strace -f -e trace=fallocate,pread64 -p $(pgrep s3backer)
strace: Process 2803849 attached with 15 threads
[pid 2803853] pread64(3, "\3\0\0\0\0\0\0\200\205\310\273\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288, 11765092352) = 282624
[pid 2803853] pread64(3, "", 241664, 11765374976) = 0
[pid 2803853] pread64(3, "\3\0\0\0\0\0\0\200\205\310\273\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288, 11765092352) = 282624
[pid 2803853] pread64(3, "", 241664, 11765374976) = 0
# ls -l /mnt/dreamhost_cache/cachefile
-rw-r--r-- 1 root root 11765374976 Jan 15 13:21 /mnt/dreamhost_cache/cachefile
So it is reading just at the end of the cache file.
OK - above comments except for splitting up writes into sub-ranges addressed in 394c842.
I will look at that next.
See if c8343a2 looks right.
I think this feature led to corruption in my use case. If the block cache is in NFS filesystem, accessing the freshly unallocated blocks leads IO errors and corruption. I noticed this by running zpool trim on my filesystem. Also, I don't think this is useful since:
I suggest to make this feature optional and turned off by default. Or even better, make zero block tracking happen inside the s3backer block cache with a separate zero block bitmap or a journal, so block cache IO could be skipped completely. That would be portable too.
I think this feature led to corruption in my use case. If the block cache is in NFS filesystem...
Hmm... where do you think this corruption is coming from? If you think it's coming from s3backer, then it shouldn't matter whether NFS is being used or not.
Do you have a minimal reproducible test case, e.g., not involving NFS?
Thanks.
I noticed that while zero blocks are not written to remote storage, they are written to the block cache. This is not optimal. For example I am using s3backer as one mirror vdev of my ZFS pool, and cache file on SSD. Those writes shorten the life of SSD and they slow down the ZFS pool, especially the initial sync (doing zpool attach).
So I would like to punch holes in the cache file using
fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,...)
. I already made the minimal changes and am testing it, I have few issues or questions.