Closed jeromerobert closed 2 years ago
Oops. With version 1.5.6:
# dd if=/dev/zero of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.166752 s, 25.2 MB/s
So yes, this is a duplicate of #158.
--listBlock
improve a bit but it's still not as good as 1.5.6 (this is with master):
# dd if=/dev/zero of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.964937 s, 4.3 MB/s
This is with Amazon Linux 2:
Please review #158 for some suggestions for narrowing this down.
A few notes...
--directIO
-d
output I see write[0] 4096 bytes to 0
which indicates the kernel is writing 4K blocks rather than 4M blocks, which is going to result is fragmentation effects, which muddies the picture. I think (?) --directIO
may fix this.Ok, --directIO
does fix the issue with 4MiB blocks. But performance with file systems (XFS or ext4) is still bad. I guess that's because they use much smaller blocks. Here are tests which show influence of blocks size. I run s3backer --erase
before each test (so non-zero block survey is almost instant):
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.0701948 s, 59.8 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.150693 s, 27.8 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.0454611 s, 92.3 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0711782 s, 58.9 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 1.97789 s, 2.1 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.86789 s, 2.2 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.142277 s, 29.5 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=256k count=16
16+0 records in
16+0 records out
4194304 bytes (4.2 MB) copied, 0.0971339 s, 43.2 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=64k count=64
64+0 records in
64+0 records out
4194304 bytes (4.2 MB) copied, 0.313716 s, 13.4 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=16k count=256
256+0 records in
256+0 records out
4194304 bytes (4.2 MB) copied, 0.531904 s, 7.9 MB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.71244 s, 2.4 MB/s
So my understanding is that to get decent performance with 1.6.x I should rerun mkfs.xfs
with a block size greater than 64kiB. Is it expected ?
OK next question... are we trying to test the bandwidth of the network or the bandwidth of the cache file here?
Are you still configuring with --blockCacheFile=/var/cache/s3backer --blockCacheWriteDelay=60000 --blockCacheSize=4000
during these tests?
If so then one thing to do would be repeat the tests without using the block cache at all (i.e., --blockCacheSize=0
) to see if the block cache is the source of the apparent slowdown.
OK next question... are we trying to test the bandwidth of the network or the bandwidth of the cache file here?
Bandwith of cache only.
Are you still configuring with --blockCacheFile=/var/cache/s3backer --blockCacheWriteDelay=60000 --blockCacheSize=4000 during these tests?
Yes
If so then one thing to do would be repeat the tests without using the block cache at all (i.e., --blockCacheSize=0) to see if the block cache is the source of the apparent slowdown.
1.6.x is much faster if we do not concider caching:
1.5.6 directIO nocache
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 20.8271 s, 201 kB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=32k count=128
128+0 records in
128+0 records out
4194304 bytes (4.2 MB) copied, 90.1539 s, 46.5 kB/s
master directIO nocache listblock
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 7.52153 s, 558 kB/s
# dd if=/dev/urandom of=/mnt/s3b/file bs=32k count=128
128+0 records in
128+0 records out
4194304 bytes (4.2 MB) copied, 28.1459 s, 149 kB/s
Another dumb question... are you running these two tests on the same machine, where the cache files are stored on the same physical disk?
Yes, all tests ran on the same EC2 instance and the cache file was on the root EBS in all cases.
If I disable the zero_cache_store
in s3b_config.c
the performance is good. Also I can see that without zero_cache_store
, a dd
with 4kiB call block_cache_write_block_part
in the block_cache_store
. With zero_cache_store
enabled, a dd
with 4kiB calls block_cache_write_block
(without the _part
) in the block_cache_store
.
The performance issue comes from block_part_write_block_part
. It read the whole block, modify it, and write it. It's probably the way to go between the block cache layer and the S3 layer. But when used between the zero_cache layer and the block_cache it leads to a lot of useless I/O.
Ah, that makes sense. Fortunately that problem should be gone now after commit d1bce951, which simplifies the partial block writing logic. Can you try again with the latest version from the master
branch?
Sadly, this is still as slow.
Sadly, this is still as slow.
Hmm, now that I think about it, yes it will still be slow, and for the same reason as before. Sorry for the distraction.
Basically the zero_cache
is a new layer created to fix #139. It definitely improves performance in some cases but hurts performance in the scenario you're testing.
I'll need to think about how to address this downside...
Just to recap... we have this:
s3backer --blockSize=4M ...
and this:
# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
and this:
...
unique: 2076, opcode: WRITE (16), nodeid: 2, insize: 4176, pid: 21805
write[0] 4096 bytes to 4190208 flags: 0x8001
write[0] 4096 bytes to 4190208
unique: 2076, success, outsize: 24
...
So the kernel is writing 4K chunks, but your s3backer block size is 4M, so that means each kernel write is amplified by a factor of 1024.
In this test, in effect what's happening is:
So we can try to tweak performance of the code, but the code is not the real problem... instead, it's what we're asking the code to do that is the real problem.
In the older code, there was no zero_cache
, and so the block cache was the top layer of the stack (just under FUSE) and so it was able to "patch" data directly into the cache without copying it around. This was understandably much faster.
Now the zero_cache
partial writes are handled above the block cache, so they have to be done in the dumb "copy and patch and write-back" fashion, which is much slower. Previously an under-appreciated part of the block cache's job was to absorb some of the thrashing that occurs when the kernel performs a bunch of partial-block writes.
Probably the right way to fix this is to just merge the zero cache logic into the block cache.
OK - hopefully this should be fixed in d949846. Please try another test.
works like a charm now from my side :)
On my side too:
# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0771566 s, 54.4 MB/s
Great - thanks guys for all help with reports & testing.
I think I have a performance issue because the write bandwidth of my s3backer file is very slow. This is the same with s3backer 1.6.1 and current master.
It's mounted like this:
I do the following performance test.
My cache write delay is long and cache file big enough so, as expected,
--debug
show that there is no S3 access during the test. The bandwidth remains very low even if I increase count. With the-d
flag (enable fuse logs) I get that:I guess that this is expected because my cache file system has a 4kiB block size.
If I increase the
dd
count
to 100 to make the test longer I can see that s3backer is CPU bounded (100% intop
) andperf top -p PID
tells that it's spending most of the time incopy_user_generic_string
(kernel symbol).Without the fuse/s3backer layer the performance is:
Any idea how I should track that ? I can build with debug, add trace etc...
This might be related to #158 but as I'm not sure I preferred to open a new issue.