[enhancement] zero_cache should write a full block when writing into all-zero block, thus avoiding read-modify-write cycle

xmrk-btc commented 1 year ago

When zero_cache is doing write_block_part into all-zero block, it passes the request to block cache, which possibly does not have the full block, so it needs to read it from remote storage. Which is unnecessary, zero_cache can reconstruct the whole block, and avoid the reading and delay.

A little test: zero the first 200 block (blocksize of s3backer is 64k) $ dd if=/dev/zero of=mount/file bs=64k count=200

Write into the middle of 2nd block and see the log file

$ dd if=/dev/urandom of=mount/file bs=1k count=1 seek=100
$ journalctl -f -t s3backer
Jan 15 16:13:12 ubuntu s3backer[3679793]: GET https://objects-us-east-1.dream.io/test3/00000001
Jan 15 16:13:12 ubuntu s3backer[3679793]: rec'd 404 response: GET https://objects-us-east-1.dream.io/test3/00000001
Jan 15 16:13:12 ubuntu s3backer[3679793]: PUT https://objects-us-east-1.dream.io/test3/00000001
Jan 15 16:13:13 ubuntu s3backer[3679793]: success: PUT https://objects-us-east-1.dream.io/test3/00000001

And I want to get rid of that GET request. (Sometimes it does not appear, probably because I did various tests on the same bucket, so there may be stale data which get DELETEd and cached in block_cache.)

The real-world motivation here is latency - when resilvering ZFS after implementing sub-block hole punching, the resilvering was still slow (like 500 kB/s), without much disk activity, and I saw http_zero_blocks_read stat incrementing by 1 or 2 per second. So I guess the latency of the read-modify-write cycle was killing the performance.

archiecobbs commented 1 year ago

Good idea! Should be fixed in e0d21a9.

Regarding locking, this is not a problem. The race condition is between two filesystem threads accessing the same s3backer block at the same time; that's now absorbed at the top layer; see d949846e.

archiecobbs commented 1 year ago

On second thought, I think that this idea is not actually safe.

Instead the safer (and simpler) fix is just to eliminate the function zero_cache_write_block_part(). Then the function block_part_write_block_part() will perform a read-modify-write, but the read will be intercepted by the zero cache, and therefore not result in any network traffic.

Updated in bb65ec2.

archiecobbs commented 1 year ago

Thinking even more, I'm still a little unsure about this. I think my second patch, while safe from race conditions, might also bring back the write amplification problem.

This needs more thought. E.g., consolidation of the block cache and the zero cache into a single entity, etc.

archiecobbs / s3backer

[enhancement] zero_cache should write a full block when writing into all-zero block, thus avoiding read-modify-write cycle #201