Rudde / s3backer

Automatically exported from code.google.com/p/s3backer
GNU General Public License v2.0
0 stars 0 forks source link

Writing to s3backer uses too much CPU #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. On a small EC2 instance, start s3backer. Use blockSize=1M. Optionally
try using blockCacheWriteDelay>0 although it doesn't solve this specific
problem.
2. Format it and mount it. I tested several filesystems, ZFS(FUSE) caused
s3backer to use most CPU, JFS least, and Ext2 was in the middle.
3. Using 'dd' create a 100MB test file, initializing it from /dev/zero (or
optionally from /dev/urandom for slightly different results)
4. Copy the 100MB test file to the filesystem mounted on s3backer. Watch
the s3backer process using 100% CPU for more than a minute, while the
throughput is quite modest (1MB/s - 5MB/s, depending on the filesystem).

What is the expected output?

s3backer should be I/O-bound, not CPU-bound. At such low speeds it
shouldn't use 100% CPU. The only two CPU-intensive operations it performs
are AFAIK md5 calculation and zero-block checking. Both of them should be
considerably faster than 1-5 MB/s, which makes me think there is a bug
somewhere. For example, could s3backer be calculating the MD5 hash of the
entire 1MB block each time a 4096-byte sector is written?

What version of the product are you using? On what operating system?

r277 from SVN. OS is Ubuntu Intrepid (8.10) 32-bit on a small EC2 instance.

Original issue reported on code.google.com by onestone on 29 Sep 2008 at 1:18

GoogleCodeExporter commented 9 years ago
s3backer will certainly MD5 the entire 1M block each time a 4k piece of it is 
written
to by the kernel IF you have blockCacheWriteDelay set to zero.

However, when blockCacheWriteDelay is large enough so that all 4k pieces of a 1M
block get written into by the kernel before s3backer begins its first attempt to
write that block to Amazon, each 1M block should only be MD5'ed one time.

Just to confirm: you get the same slowness with e.g. 
--blockCacheWriteDelay=1000 ?

What about if you run in test mode? (if running on EC2, don't forget to run it 
twice
to initialize the EC2 backing store).

Original comment by archie.c...@gmail.com on 29 Sep 2008 at 1:28

GoogleCodeExporter commented 9 years ago
Yes, I get the 100% CPU usage even with --blockCacheWriteDelay=30000 (30 
seconds).

I'll try running it in test mode, but it probably will be the same (because the
problem also occurs with 100MB all-zeroes files, which causes almost no HTTP 
traffic).

Original comment by onestone on 29 Sep 2008 at 1:33

GoogleCodeExporter commented 9 years ago
Confirmed with --test too (with initialized backing store). CPU is at 100% again
while the speeds with JFS are the same (1MB/s - 2MB/s). Here are the options I 
used:

s3backer --test --size=10g --blockCacheWriteDelay=1000 --blockSize=1m 
/mnt/temp/s3b
/mnt/temp/mnt

Original comment by onestone on 29 Sep 2008 at 1:59

GoogleCodeExporter commented 9 years ago
I did some --test profiling on my local machine (the 100% CPU problem exists on 
it
too). Most of the CPU time is spent inside ec_protect_write_block(). Here are 
the
complete results from OProfile:

Profiling through timer interrupt
samples  %        image name               symbol name
175      59.1216  s3backer                 ec_protect_write_block
32       10.8108  s3backer                 test_io_write_block
28        9.4595  s3backer                 fuse_op_write
22        7.4324  [vdso] (tgid:6163 range:0xb80aa000-0xb80ab000) (no symbols)
14        4.7297  s3backer                 s3b_hash_get
13        4.3919  s3backer                 block_cache_write_block
8         2.7027  s3backer                 block_cache_do_read_block
3         1.0135  s3backer                 .plt
1         0.3378  s3backer                 block_cache_worker_main

The test was writing a 100 MB all-zeroes file to a "dirty" () Ext2 filesystem 
on top
of s3backer storing its blocks on a JFS partition. If the Ext2 filesystem is 
clean,
the time spent inside ec_protect_write_block() is closer to 90%. I ran s3backer 
with
these options:

s3backer --test --size=1g --blockCacheWriteDelay=1000 --blockCacheSize=32
--blockSize=1m /home/pesho/temp/s3b/ /home/pesho/temp/mnt/

Note that the problem still exists if --blockCacheSize is large (>200) on a 
clean
filesystem. 

As I understand it, ec_protect_write_block() is part of the MD5 caching layer. 
But as
you said, it shouldn't be called so often when --blockCacheWriteDelay>=1000. So,
something seems wrong here.

Original comment by onestone on 29 Sep 2008 at 6:31

GoogleCodeExporter commented 9 years ago
One thing to be careful of with these tests is the MD5 cache (the "ec_protect") 
layer.

E.g., if you try to write N blocks quickly, and the size of your MD5 cache is M 
< N,
then the when you write the M+1st block s3backer will block until the first 
block
written expires. This won't happen until "md5CacheTimeout" milliseconds after 
it was
originally written.

To turn off the ec_protect layer, add these flags: --md5CacheSize=0 
--md5CacheTime=0
--minWriteDelay=0.

However, even when I do this there still seems to be some weird slowness... 
still
investigating.

Original comment by archie.c...@gmail.com on 29 Sep 2008 at 7:57

GoogleCodeExporter commented 9 years ago
Yes, but in this case N < M, because N is about 100 (100 blocks 1 MB each) and 
M is
1000 (the default value). Besides, if the MD5 cache is full, it should sleep on 
a
pthread mutex or condition variable, not consume CPU cycles.

Original comment by onestone on 29 Sep 2008 at 8:07

GoogleCodeExporter commented 9 years ago
OK, I think I understand what is happening now. It's the large block size that 
is
causing a bunch of back and forth of the data, because the kernel is writing 4k
blocks at a time.

Running the exact same test but changing --blockSize=1m to --blockSize=4k seems 
to
make the problem go away (more or less) for me... can you confirm?

I think there are some simple ways to optimize block size mismatches better 
than what
we're currently doing. I'll look into it.

Original comment by archie.c...@gmail.com on 29 Sep 2008 at 9:03

GoogleCodeExporter commented 9 years ago
I confirm, no excessive CPU usage with --blockSize=4k. Strangely, the file 
copying
was still slow and with big pauses for some reason, with 'htop' showing most of 
the
time being spent in io-wait state. Perhaps some operations were delayed by the 
block
cache or the md5 cache.

Original comment by onestone on 29 Sep 2008 at 9:59

GoogleCodeExporter commented 9 years ago
Does the I/O block go away if you set the write delay to, say, 50ms?  IIRC, I 
maxed a
fiber line @ 800 K/B-1 M/B with a write delay between 25 and 50 ms and 20 cache
threads on version 1.2.0-RC1.  I think my loadavg was just over .1, FWIW.  If 
you're
over 1, you're creating writes faster than they can be serviced I believe.

You may also get a little more mileage by adjusting your kernel's
dirty_background_ratio to shift more of the work on to the pdflush processes.  
By
default, after about 10% dirty RAM, the process creating the writes has to use 
its
own time slice to write dirty blocks to disk.  

I'm wondering if you've got every cache thread sleeping until its next allowed 
4K
block write (FS level) to the same S3Backer block.  

I've found that I have to add more threads at write delays higher than ~250ms, 
IIRC.    

Original comment by scott.lo...@gmail.com on 4 Oct 2008 at 2:51

GoogleCodeExporter commented 9 years ago
@scott: No, the I/O block doesn't go away. It seems you are talking about a 
different
problem. You need to have a large blockSize in order to observe the high CPU 
utilization.

@archie: Does s3backer currently set FUSE's '-o direct_io' option? If it 
doesn't,
shouldn't the kernel be performing write-ahead caching and passing larger 
chunks to
s3backer, thus lessening the problem?

Original comment by onestone on 11 Oct 2008 at 3:52

GoogleCodeExporter commented 9 years ago
As described in the man page, s3backer does not set 'direct_io' automatically 
but you
can set it yourself by adding it to the command line.

As for write-ahead caching, the kernel always writes in chunks of 4096 as far 
as I
can tell. This I believe is simply a side effect of the fact that VM pages are 
4096
bytes, and your filesystem is mounted via a loopback mount, and therefore the 
thing
the kernel is writing when it sends FUSE something to write is a page of a file 
in
the file cache.

I have a patch that should help this problem, and should be ready to commit it 
in a
couple of days.

Original comment by archie.c...@gmail.com on 11 Oct 2008 at 4:07

GoogleCodeExporter commented 9 years ago
This issue should be fixed in r297. Please give it a whirl.

Original comment by archie.c...@gmail.com on 13 Oct 2008 at 6:01

GoogleCodeExporter commented 9 years ago
Confirming as fixed. S3backer is amazingly fast now. Thanks :)

Original comment by onestone on 13 Oct 2008 at 6:15

GoogleCodeExporter commented 9 years ago

Original comment by archie.c...@gmail.com on 23 Oct 2008 at 4:42