Cache bandwidth much lower in version 1.6.x than in 1.5.6

jeromerobert commented 2 years ago

I think I have a performance issue because the write bandwidth of my s3backer file is very slow. This is the same with s3backer 1.6.1 and current master.

It's mounted like this:

s3backer -f --debug --prefix=myprefix/ --storageClass=INTELLIGENT_TIERING \
 --blockCacheFile=/var/cache/s3backer --blockCacheWriteDelay=60000 \
 --blockCacheSize=4000 --size=50G --accessEC2IAM=xxx --region=eu-west-1 --no-vhost --ssl \
 --blockSize=4M mybucket /mnt/s3b

I do the following performance test.

$ dd if=/dev/zero of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 1.89873 s, 2.2 MB/s

My cache write delay is long and cache file big enough so, as expected, --debug show that there is no S3 access during the test. The bandwidth remains very low even if I increase count. With the -d flag (enable fuse logs) I get that:

   write[0] 4096 bytes to 0
   unique: 10, success, outsize: 24
unique: 11, opcode: WRITE (16), nodeid: 2, insize: 4176, pid: 21801
...
   write[0] 4096 bytes to 4186112
   unique: 2075, success, outsize: 24
unique: 2076, opcode: WRITE (16), nodeid: 2, insize: 4176, pid: 21805
write[0] 4096 bytes to 4190208 flags: 0x8001
   write[0] 4096 bytes to 4190208
   unique: 2076, success, outsize: 24
unique: 2077, opcode: FLUSH (25), nodeid: 2, insize: 64, pid: 21805
flush[0]
   unique: 2077, success, outsize: 16
unique: 2078, opcode: RELEASE (18), nodeid: 2, insize: 64, pid: 0
release[0] flags: 0x8001
   unique: 2078, success, outsize: 16

I guess that this is expected because my cache file system has a 4kiB block size.

If I increase the dd count to 100 to make the test longer I can see that s3backer is CPU bounded (100% in top) and perf top -p PID tells that it's spending most of the time in copy_user_generic_string (kernel symbol).

Without the fuse/s3backer layer the performance is:

$ dd if=/dev/urandom of=/var/cache/file bs=4M count=2
2+0 records in
2+0 records out
8388608 bytes (8.4 MB) copied, 0.173998 s, 48.2 MB/s

Any idea how I should track that ? I can build with debug, add trace etc...

This might be related to #158 but as I'm not sure I preferred to open a new issue.

jeromerobert commented 2 years ago

Oops. With version 1.5.6:

# dd if=/dev/zero of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.166752 s, 25.2 MB/s

So yes, this is a duplicate of #158.

jeromerobert commented 2 years ago

--listBlock improve a bit but it's still not as good as 1.5.6 (this is with master):

# dd if=/dev/zero of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.964937 s, 4.3 MB/s

jeromerobert commented 2 years ago

This is with Amazon Linux 2:

linux 4.14.252
glibc 2.26
fuse 2.9.2

archiecobbs commented 2 years ago

Please review #158 for some suggestions for narrowing this down.

A few notes...

Writing zeroes is not really a valid bandwidth test because s3backer never writes zero blocks, instead they translate to DELETE requests
You should probably be using --directIO
With -d output I see write[0] 4096 bytes to 0 which indicates the kernel is writing 4K blocks rather than 4M blocks, which is going to result is fragmentation effects, which muddies the picture. I think (?) --directIO may fix this.
Are you allowing the zero block survey to complete before starting the test?

jeromerobert commented 2 years ago

Ok, --directIO does fix the issue with 4MiB blocks. But performance with file systems (XFS or ext4) is still bad. I guess that's because they use much smaller blocks. Here are tests which show influence of blocks size. I run s3backer --erase before each test (so non-zero block survey is almost instant):

1.5.6 nodirectIO

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.0701948 s, 59.8 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.150693 s, 27.8 MB/s

1.5.6 directIO

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.0454611 s, 92.3 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0711782 s, 58.9 MB/s

master nodirectIO listBlocks

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 1.97789 s, 2.1 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.86789 s, 2.2 MB/s

master directIO listBlocks

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.142277 s, 29.5 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=256k count=16
16+0 records in
16+0 records out
4194304 bytes (4.2 MB) copied, 0.0971339 s, 43.2 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=64k count=64
64+0 records in
64+0 records out
4194304 bytes (4.2 MB) copied, 0.313716 s, 13.4 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=16k count=256
256+0 records in
256+0 records out
4194304 bytes (4.2 MB) copied, 0.531904 s, 7.9 MB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.71244 s, 2.4 MB/s

So my understanding is that to get decent performance with 1.6.x I should rerun mkfs.xfs with a block size greater than 64kiB. Is it expected ?

archiecobbs commented 2 years ago

OK next question... are we trying to test the bandwidth of the network or the bandwidth of the cache file here?

Are you still configuring with --blockCacheFile=/var/cache/s3backer --blockCacheWriteDelay=60000 --blockCacheSize=4000 during these tests?

If so then one thing to do would be repeat the tests without using the block cache at all (i.e., --blockCacheSize=0) to see if the block cache is the source of the apparent slowdown.

jeromerobert commented 2 years ago

OK next question... are we trying to test the bandwidth of the network or the bandwidth of the cache file here?

Bandwith of cache only.

Are you still configuring with --blockCacheFile=/var/cache/s3backer --blockCacheWriteDelay=60000 --blockCacheSize=4000 during these tests?

Yes

If so then one thing to do would be repeat the tests without using the block cache at all (i.e., --blockCacheSize=0) to see if the block cache is the source of the apparent slowdown.

1.6.x is much faster if we do not concider caching:

1.5.6 directIO nocache

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 20.8271 s, 201 kB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=32k count=128
128+0 records in
128+0 records out
4194304 bytes (4.2 MB) copied, 90.1539 s, 46.5 kB/s

master directIO nocache listblock

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 7.52153 s, 558 kB/s

# dd if=/dev/urandom of=/mnt/s3b/file bs=32k count=128
128+0 records in
128+0 records out
4194304 bytes (4.2 MB) copied, 28.1459 s, 149 kB/s

archiecobbs commented 2 years ago

Another dumb question... are you running these two tests on the same machine, where the cache files are stored on the same physical disk?

jeromerobert commented 2 years ago

Yes, all tests ran on the same EC2 instance and the cache file was on the root EBS in all cases.

jeromerobert commented 2 years ago

If I disable the zero_cache_store in s3b_config.c the performance is good. Also I can see that without zero_cache_store , a dd with 4kiB call block_cache_write_block_part in the block_cache_store. With zero_cache_store enabled, a dd with 4kiB calls block_cache_write_block (without the _part) in the block_cache_store.

jeromerobert commented 2 years ago

The performance issue comes from block_part_write_block_part. It read the whole block, modify it, and write it. It's probably the way to go between the block cache layer and the S3 layer. But when used between the zero_cache layer and the block_cache it leads to a lot of useless I/O.

archiecobbs commented 2 years ago

Ah, that makes sense. Fortunately that problem should be gone now after commit d1bce951, which simplifies the partial block writing logic. Can you try again with the latest version from the master branch?

jeromerobert commented 2 years ago

Sadly, this is still as slow.

archiecobbs commented 2 years ago

Sadly, this is still as slow.

Hmm, now that I think about it, yes it will still be slow, and for the same reason as before. Sorry for the distraction.

Basically the zero_cache is a new layer created to fix #139. It definitely improves performance in some cases but hurts performance in the scenario you're testing.

I'll need to think about how to address this downside...

archiecobbs commented 2 years ago

Just to recap... we have this:

s3backer --blockSize=4M ...

and this:

# dd if=/dev/urandom of=/mnt/s3b/file bs=4M count=1

and this:

...
unique: 2076, opcode: WRITE (16), nodeid: 2, insize: 4176, pid: 21805
write[0] 4096 bytes to 4190208 flags: 0x8001
   write[0] 4096 bytes to 4190208
   unique: 2076, success, outsize: 24
...

So the kernel is writing 4K chunks, but your s3backer block size is 4M, so that means each kernel write is amplified by a factor of 1024.

In this test, in effect what's happening is:

Read the first 4M block from the block cache
Patch 4K of it with new data
Write it back to the cache
Repeat 1023 more times

So we can try to tweak performance of the code, but the code is not the real problem... instead, it's what we're asking the code to do that is the real problem.

In the older code, there was no zero_cache, and so the block cache was the top layer of the stack (just under FUSE) and so it was able to "patch" data directly into the cache without copying it around. This was understandably much faster.

Now the zero_cache partial writes are handled above the block cache, so they have to be done in the dumb "copy and patch and write-back" fashion, which is much slower. Previously an under-appreciated part of the block cache's job was to absorb some of the thrashing that occurs when the kernel performs a bunch of partial-block writes.

Probably the right way to fix this is to just merge the zero cache logic into the block cache.

archiecobbs commented 2 years ago

OK - hopefully this should be fixed in d949846. Please try another test.

DEvmIb commented 2 years ago

works like a charm now from my side :)

jeromerobert commented 2 years ago

On my side too:

# dd if=/dev/urandom of=/mnt/s3b/file bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0771566 s, 54.4 MB/s

archiecobbs commented 2 years ago

Great - thanks guys for all help with reports & testing.

archiecobbs / s3backer