Closed GoogleCodeExporter closed 9 years ago
s3backer will certainly MD5 the entire 1M block each time a 4k piece of it is
written
to by the kernel IF you have blockCacheWriteDelay set to zero.
However, when blockCacheWriteDelay is large enough so that all 4k pieces of a 1M
block get written into by the kernel before s3backer begins its first attempt to
write that block to Amazon, each 1M block should only be MD5'ed one time.
Just to confirm: you get the same slowness with e.g.
--blockCacheWriteDelay=1000 ?
What about if you run in test mode? (if running on EC2, don't forget to run it
twice
to initialize the EC2 backing store).
Original comment by archie.c...@gmail.com
on 29 Sep 2008 at 1:28
Yes, I get the 100% CPU usage even with --blockCacheWriteDelay=30000 (30
seconds).
I'll try running it in test mode, but it probably will be the same (because the
problem also occurs with 100MB all-zeroes files, which causes almost no HTTP
traffic).
Original comment by onestone
on 29 Sep 2008 at 1:33
Confirmed with --test too (with initialized backing store). CPU is at 100% again
while the speeds with JFS are the same (1MB/s - 2MB/s). Here are the options I
used:
s3backer --test --size=10g --blockCacheWriteDelay=1000 --blockSize=1m
/mnt/temp/s3b
/mnt/temp/mnt
Original comment by onestone
on 29 Sep 2008 at 1:59
I did some --test profiling on my local machine (the 100% CPU problem exists on
it
too). Most of the CPU time is spent inside ec_protect_write_block(). Here are
the
complete results from OProfile:
Profiling through timer interrupt
samples % image name symbol name
175 59.1216 s3backer ec_protect_write_block
32 10.8108 s3backer test_io_write_block
28 9.4595 s3backer fuse_op_write
22 7.4324 [vdso] (tgid:6163 range:0xb80aa000-0xb80ab000) (no symbols)
14 4.7297 s3backer s3b_hash_get
13 4.3919 s3backer block_cache_write_block
8 2.7027 s3backer block_cache_do_read_block
3 1.0135 s3backer .plt
1 0.3378 s3backer block_cache_worker_main
The test was writing a 100 MB all-zeroes file to a "dirty" () Ext2 filesystem
on top
of s3backer storing its blocks on a JFS partition. If the Ext2 filesystem is
clean,
the time spent inside ec_protect_write_block() is closer to 90%. I ran s3backer
with
these options:
s3backer --test --size=1g --blockCacheWriteDelay=1000 --blockCacheSize=32
--blockSize=1m /home/pesho/temp/s3b/ /home/pesho/temp/mnt/
Note that the problem still exists if --blockCacheSize is large (>200) on a
clean
filesystem.
As I understand it, ec_protect_write_block() is part of the MD5 caching layer.
But as
you said, it shouldn't be called so often when --blockCacheWriteDelay>=1000. So,
something seems wrong here.
Original comment by onestone
on 29 Sep 2008 at 6:31
One thing to be careful of with these tests is the MD5 cache (the "ec_protect")
layer.
E.g., if you try to write N blocks quickly, and the size of your MD5 cache is M
< N,
then the when you write the M+1st block s3backer will block until the first
block
written expires. This won't happen until "md5CacheTimeout" milliseconds after
it was
originally written.
To turn off the ec_protect layer, add these flags: --md5CacheSize=0
--md5CacheTime=0
--minWriteDelay=0.
However, even when I do this there still seems to be some weird slowness...
still
investigating.
Original comment by archie.c...@gmail.com
on 29 Sep 2008 at 7:57
Yes, but in this case N < M, because N is about 100 (100 blocks 1 MB each) and
M is
1000 (the default value). Besides, if the MD5 cache is full, it should sleep on
a
pthread mutex or condition variable, not consume CPU cycles.
Original comment by onestone
on 29 Sep 2008 at 8:07
OK, I think I understand what is happening now. It's the large block size that
is
causing a bunch of back and forth of the data, because the kernel is writing 4k
blocks at a time.
Running the exact same test but changing --blockSize=1m to --blockSize=4k seems
to
make the problem go away (more or less) for me... can you confirm?
I think there are some simple ways to optimize block size mismatches better
than what
we're currently doing. I'll look into it.
Original comment by archie.c...@gmail.com
on 29 Sep 2008 at 9:03
I confirm, no excessive CPU usage with --blockSize=4k. Strangely, the file
copying
was still slow and with big pauses for some reason, with 'htop' showing most of
the
time being spent in io-wait state. Perhaps some operations were delayed by the
block
cache or the md5 cache.
Original comment by onestone
on 29 Sep 2008 at 9:59
Does the I/O block go away if you set the write delay to, say, 50ms? IIRC, I
maxed a
fiber line @ 800 K/B-1 M/B with a write delay between 25 and 50 ms and 20 cache
threads on version 1.2.0-RC1. I think my loadavg was just over .1, FWIW. If
you're
over 1, you're creating writes faster than they can be serviced I believe.
You may also get a little more mileage by adjusting your kernel's
dirty_background_ratio to shift more of the work on to the pdflush processes.
By
default, after about 10% dirty RAM, the process creating the writes has to use
its
own time slice to write dirty blocks to disk.
I'm wondering if you've got every cache thread sleeping until its next allowed
4K
block write (FS level) to the same S3Backer block.
I've found that I have to add more threads at write delays higher than ~250ms,
IIRC.
Original comment by scott.lo...@gmail.com
on 4 Oct 2008 at 2:51
@scott: No, the I/O block doesn't go away. It seems you are talking about a
different
problem. You need to have a large blockSize in order to observe the high CPU
utilization.
@archie: Does s3backer currently set FUSE's '-o direct_io' option? If it
doesn't,
shouldn't the kernel be performing write-ahead caching and passing larger
chunks to
s3backer, thus lessening the problem?
Original comment by onestone
on 11 Oct 2008 at 3:52
As described in the man page, s3backer does not set 'direct_io' automatically
but you
can set it yourself by adding it to the command line.
As for write-ahead caching, the kernel always writes in chunks of 4096 as far
as I
can tell. This I believe is simply a side effect of the fact that VM pages are
4096
bytes, and your filesystem is mounted via a loopback mount, and therefore the
thing
the kernel is writing when it sends FUSE something to write is a page of a file
in
the file cache.
I have a patch that should help this problem, and should be ready to commit it
in a
couple of days.
Original comment by archie.c...@gmail.com
on 11 Oct 2008 at 4:07
This issue should be fixed in r297. Please give it a whirl.
Original comment by archie.c...@gmail.com
on 13 Oct 2008 at 6:01
Confirming as fixed. S3backer is amazingly fast now. Thanks :)
Original comment by onestone
on 13 Oct 2008 at 6:15
Original comment by archie.c...@gmail.com
on 23 Oct 2008 at 4:42
Original issue reported on code.google.com by
onestone
on 29 Sep 2008 at 1:18