facebookarchive / flashcache

A general purpose, write-back block cache for Linux.
GNU General Public License v2.0
1.61k stars 419 forks source link

poor performance of KVM on flashcache #132

Closed pubyun closed 11 years ago

pubyun commented 11 years ago

i create a flashcache on md1 and ssd, load and IO wait of host are high.

status shows read and write hit are low.

the OS is CentOS 6.4,

if run fio on the same production system with high load, it give me good performance.

cat /etc/redhat-release

CentOS release 6.4 (Final)

uname -a

Linux h172-16-0-9 2.6.32-358.14.1.el6.x86_64 #1 SMP Sat Jul 20 19:01:27 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/flashcache/flashcache_version

Flashcache Version : flashcache-2.0

dmsetup table

fc_md1: 0 3902860928 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/md1) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(2815563), cache percent(9) dirty blocks(619861), dirty percent(2) nr_queued(0)

dmsetup status

fc_md1: 0 3902860928 flashcache stats: reads(88096439), writes(272709908) read hits(2527409), read hit percent(2) write hits(28634512) write hit percent(10) dirty write hits(24087157) dirty write hit percent(8) replacement(2388), write replacement(85007) write invalidates(1916721), read invalidates(331602) pending enqueues(659098), pending inval(598604) metadata dirties(8930851), metadata cleans(8315979) metadata batch(11116725) metadata ssd writes(6130101) cleanings(8315975) fallow cleanings(347315) no room(21) front merge(6843701) back merge(451844) disk reads(85624857), disk writes(248067847) ssd reads(10843323) ssd writes(39773632) uncached reads(84999380), uncached writes(239751926), uncached IO requeue(5137) disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0) uncached sequential reads(419488), uncached sequential writes(63376799) pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)

sysctl -a|grep flashcache <<<

dev.flashcache.sdc+md1.io_latency_hist = 0 dev.flashcache.sdc+md1.do_sync = 0 dev.flashcache.sdc+md1.stop_sync = 0 dev.flashcache.sdc+md1.dirty_thresh_pct = 30 dev.flashcache.sdc+md1.max_clean_ios_total = 4 dev.flashcache.sdc+md1.max_clean_ios_set = 2 dev.flashcache.sdc+md1.do_pid_expiry = 0 dev.flashcache.sdc+md1.max_pids = 100 dev.flashcache.sdc+md1.pid_expiry_secs = 60 dev.flashcache.sdc+md1.reclaim_policy = 1 dev.flashcache.sdc+md1.zero_stats = 0 dev.flashcache.sdc+md1.fast_remove = 0 dev.flashcache.sdc+md1.cache_all = 1 dev.flashcache.sdc+md1.fallow_clean_speed = 2 dev.flashcache.sdc+md1.fallow_delay = 900 dev.flashcache.sdc+md1.skip_seq_thresh_kb = 512

iostat -xkN 5

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 22.20 18.80 88.80 75.20 8.00 0.01 0.30 0.12 0.48 sdb 153.60 68.60 24.00 73.20 349.20 313.70 13.64 0.64 6.73 4.63 45.02 sda 0.00 74.60 0.00 67.00 0.00 313.70 9.36 4.06 60.90 14.93 100.00 md1 0.00 0.00 177.60 123.00 349.20 266.20 4.09 0.00 0.00 0.00 0.00

<disk type='file' device='disk'>
  <driver name='qemu' type='qcow2' cache='none'/>
  <source file='/home/kvm/instances001.qcow2'/>
  <target dev='vda' bus='virtio'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>

i run fio:

fio -filename=/home/kvm/test.iso -iodepth=64 -ioengine=libaio -direct=1 -rw=randread -bs=4k -size=5G -numjobs=64 -runtime=20 -group_reporting -name=test-rand-read

test-rand-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 ... test-rand-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.0.13 Starting 64 processes Jobs: 64 (f=64): [rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr] [2.6% done] [151.8M/0K/0K /s] [38.9K/0 /0 iops] [eta 13m:57s] test-rand-read: (groupid=0, jobs=64): err= 0: pid=8411: Sun Jul 28 11:09:46 2013 read : io=8764.3MB, bw=421936KB/s, iops=105484 , runt= 21270msec slat (usec): min=3 , max=133025 , avg=567.80, stdev=4786.41 clat (usec): min=0 , max=4819.2K, avg=38066.10, stdev=82945.67 lat (usec): min=4 , max=4819.2K, avg=38634.18, stdev=83316.49 clat percentiles (usec): | 1.00th=[ 270], 5.00th=[ 370], 10.00th=[ 438], 20.00th=[ 548], | 30.00th=[ 684], 40.00th=[ 1320], 50.00th=[37632], 60.00th=[40192], | 70.00th=[42752], 80.00th=[46848], 90.00th=[87552], 95.00th=[121344], | 99.00th=[166912], 99.50th=[197632], 99.90th=[1286144], 99.95th=[1531904], | 99.99th=[2768896] bw (KB/s) : min= 507, max=14088, per=1.61%, avg=6808.31, stdev=4171.33 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.02% lat (usec) : 100=0.02%, 250=0.57%, 500=14.67%, 750=17.86%, 1000=5.08% lat (msec) : 2=2.91%, 4=0.22%, 10=0.20%, 20=0.16%, 50=40.89% lat (msec) : 100=10.60%, 250=6.50%, 500=0.05%, 750=0.02%, 1000=0.02% lat (msec) : 2000=0.18%, >=2000=0.03% cpu : usr=0.40%, sys=2.23%, ctx=32528, majf=0, minf=5851 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=2243647/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs): READ: io=8764.3MB, aggrb=421936KB/s, minb=421936KB/s, maxb=421936KB/s, mint=21270msec, maxt=21270msec

Disk stats (read/write): dm-6: ios=776769/1100, merge=0/0, ticks=7857780/789095, in_queue=9088783, util=100.00%, aggrios=776777/2930, aggrmerge=0/0, aggrticks=8264371/833165, aggrin_queue=9104612, aggrutil=100.00% dm-0: ios=776777/2930, merge=0/0, ticks=8264371/833165, in_queue=9104612, util=100.00%, aggrios=388410/3080, aggrmerge=1719/45, aggrticks=1574723/1890, aggrin_queue=1576267, aggrutil=94.36% md1: ios=7631/2568, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3299/1339, aggrmerge=511/1192, aggrticks=575079/114131, aggrin_queue=689327, aggrutil=99.83% sdb: ios=3973/1387, merge=688/1132, ticks=264205/30142, in_queue=294624, util=58.97% sda: ios=2626/1291, merge=334/1253, ticks=885953/198120, in_queue=1084030, util=99.83% sdc: ios=769190/3593, merge=3439/90, ticks=3149447/3781, in_queue=3152535, util=94.36%

fio -filename=/home/kvm/test.iso -iodepth=64 -ioengine=libaio -direct=1 -rw=randwrite -bs=4k -size=5G -numjobs=64 -runtime=20 -group_reporting -name=test-rand-write

test-rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 ... test-rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.0.13 Starting 64 processes Jobs: 3 (f=3): [_w__w____w__] [1.8% done] [0K/34393K/0K /s] [0 /8598 /0 iops] [eta 20m:19s] s] test-rand-write: (groupid=0, jobs=64): err= 0: pid=8944: Sun Jul 28 11:10:55 2013 write: io=2106.9MB, bw=102891KB/s, iops=25722 , runt= 20968msec slat (usec): min=8 , max=144353 , avg=1262.01, stdev=7815.16 clat (usec): min=1 , max=9287.5K, avg=152480.13, stdev=205294.09 lat (usec): min=57 , max=9287.5K, avg=153742.87, stdev=205287.22 clat percentiles (usec): | 1.00th=[ 37], 5.00th=[ 68], 10.00th=[ 3824], 20.00th=[66048], | 30.00th=[77312], 40.00th=[81408], 50.00th=[96768], 60.00th=[119296], | 70.00th=[146432], 80.00th=[181248], 90.00th=[378880], 95.00th=[552960], | 99.00th=[675840], 99.50th=[757760], 99.90th=[1564672], 99.95th=[3555328], | 99.99th=[6520832] bw (KB/s) : min= 7, max= 4536, per=1.62%, avg=1668.32, stdev=719.30 lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=2.85%, 100=3.67% lat (usec) : 250=1.06%, 500=0.66%, 750=0.24%, 1000=0.17% lat (msec) : 2=0.52%, 4=0.92%, 10=1.91%, 20=1.60%, 50=4.01% lat (msec) : 100=33.23%, 250=33.03%, 500=9.79%, 750=5.79%, 1000=0.25% lat (msec) : 2000=0.21%, >=2000=0.08% cpu : usr=0.27%, sys=1.64%, ctx=532549, majf=0, minf=1757 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=539357/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs): WRITE: io=2106.9MB, aggrb=102891KB/s, minb=102891KB/s, maxb=102891KB/s, mint=20968msec, maxt=20968msec

Disk stats (read/write): dm-6: ios=621/541050, merge=0/0, ticks=34875/11477976, in_queue=11597827, util=100.00%, aggrios=621/556542, aggrmerge=0/0, aggrticks=34873/12636219, aggrin_queue=12754158, aggrutil=100.00% dm-0: ios=621/556542, merge=0/0, ticks=34873/12636219, in_queue=12754158, util=100.00%, aggrios=781/345480, aggrmerge=233/1828, aggrticks=223/869199, aggrin_queue=869039, aggrutil=89.63% md1: ios=1069/9700, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=98/3299, aggrmerge=436/6352, aggrticks=2997/198338, aggrin_queue=201479, aggrutil=93.94% sdb: ios=162/3355, merge=609/6303, ticks=1439/128383, in_queue=129639, util=50.57% sda: ios=35/3244, merge=263/6402, ticks=4556/268293, in_queue=273319, util=93.94% sdc: ios=494/681261, merge=466/3656, ticks=447/1738398, in_queue=1738078, util=89.63%

and when i run fio, iostats shows:

iostat -xkN 5

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 40.40 95.40 32.80 31193.60 292.80 125156.00 8.03 55.64 1.79 0.03 92.28 sdb 11.60 147.00 1.20 204.80 25.60 1247.00 12.36 12.61 61.09 2.22 45.80 sda 0.00 150.00 0.00 191.20 0.00 1198.90 12.54 22.94 95.66 4.16 79.62 md1 0.00 0.00 12.80 356.80 25.60 1251.70 6.91 0.00 0.00 0.00 0.00

pubyun commented 11 years ago

and we have some KVM guest on ubuntu 12.04, it works fine. read and write hit are high.

dmsetup table

kvms: 0 2515083264 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/vg0/kvm) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(20572713), cache percent(70) dirty blocks(33318), dirty percent(0) nr_queued(0)

dmsetup status

kvms: 0 2515083264 flashcache stats: reads(9206002712), writes(4831478048) read hits(7991443995), read hit percent(86) write hits(3211363145) write hit percent(66) dirty write hits(1607865974) dirty write hit percent(33) replacement(268233182), write replacement(284900996) write invalidates(369567314), read invalidates(22945861) pending enqueues(32908547), pending inval(30687086) metadata dirties(2058104090), metadata cleans(2058110448) metadata batch(3222910314) metadata ssd writes(893304088) cleanings(2058107859) fallow cleanings(58166990) no room(3060419) front merge(1569469517) back merge(322274621) disk reads(1249800943), disk writes(3225295944) ssd reads(10048630266) ssd writes(5070401573) uncached reads(738581956), uncached writes(1167402012), uncached IO requeue(142) disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0) uncached sequential reads(268882222), uncached sequential writes(795524924) pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)

dev.flashcache.sdc+kvm.do_sync = 0 dev.flashcache.sdc+kvm.stop_sync = 0 dev.flashcache.sdc+kvm.dirty_thresh_pct = 30 dev.flashcache.sdc+kvm.max_clean_ios_total = 4 dev.flashcache.sdc+kvm.max_clean_ios_set = 2 dev.flashcache.sdc+kvm.do_pid_expiry = 0 dev.flashcache.sdc+kvm.max_pids = 100 dev.flashcache.sdc+kvm.pid_expiry_secs = 60 dev.flashcache.sdc+kvm.reclaim_policy = 1 dev.flashcache.sdc+kvm.zero_stats = 0 dev.flashcache.sdc+kvm.fast_remove = 0 dev.flashcache.sdc+kvm.cache_all = 1 dev.flashcache.sdc+kvm.fallow_clean_speed = 2 dev.flashcache.sdc+kvm.fallow_delay = 900 dev.flashcache.sdc+kvm.skip_seq_thresh_kb = 512

garethbult commented 11 years ago

From memory, this won't work the way you want it to. You need to create a block device (I use LVM) then use flashcache on the block device - trying to cache a filesystem like this doesn't fly.

Once you've sorted that;

a. in your grub config, add "elevator=noop" , this should double your thruput b. In your QEMU config, add this to the driver section; cache="writeback" threads="native">

You should expect to get around 75% of native SSD speed inside your KVM instance.

pubyun commented 11 years ago

@garethbult, we are running openstack, so we use qcow2 instead of LVM. testing with fio, both file and raw device, shows high throughout and IOPS.

most guest OS are Windows, so we can't set "elevator=noop". it's dangerous to set cache="writeback" in QEMU config.

we have another server, same hardware config, running ubuntu 12.04, it works fine. the difference are most guest OS are Linux.

mohans commented 11 years ago

Can you email me the "Size Hist" at the end of the dmsetup table output for the case where the performance is bad ?

Nearly all your reads and writes are uncached. One reason for that might be that incoming IOs are being split into less than 4KB, "Size Hist" will tell us that. Also, about 20% of your writes are being detected as sequential and therefore uncached, this is probably OK and what you want.


From: Peng Yong notifications@github.com To: facebook/flashcache flashcache@noreply.github.com Sent: Sunday, July 28, 2013 7:34 PM Subject: [flashcache] poor performance of KVM on flashcache (#132)

i create a flashcache on md1 and ssd, load and IO wait of host are high. status shows read and write hit are low. the OS is CentOS 6.4, if run fio on the same production system with high load, it give me good performance. cat /etc/redhat-release CentOS release 6.4 (Final) uname -a Linux h172-16-0-9 2.6.32-358.14.1.el6.x86_64 #1 SMP Sat Jul 20 19:01:27 CST 2013 x86_64 x86_64 x86_64 GNU/Linux cat /proc/flashcache/flashcache_version Flashcache Version : flashcache-2.0

dmsetup table

fc_md1: 0 3902860928 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/md1) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(2815563), cache percent(9) dirty blocks(619861), dirty percent(2) nr_queued(0) dmsetup status fc_md1: 0 3902860928 flashcache stats: reads(88096439), writes(272709908) read hits(2527409), read hit percent(2) write hits(28634512) write hit percent(10) dirty write hits(24087157) dirty write hit percent(8) replacement(2388), write replacement(85007) write invalidates(1916721), read invalidates(331602) pending enqueues(659098), pending inval(598604) metadata dirties(8930851), metadata cleans(8315979) metadata batch(11116725) metadata ssd writes(6130101) cleanings(8315975) fallow cleanings(347315) no room(21) front merge(6843701) back merge(451844) disk reads(85624857), disk writes(248067847) ssd reads(10843323) ssd writes(39773632) uncached reads(84999380), uncached writes(239751926), uncached IO requeue(5137) disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0) uncached sequential reads(419488), uncached sequential writes(63376799) pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0) sysctl -a|grep flashcache <<< dev.flashcache.sdc+md1.io_latency_hist = 0 dev.flashcache.sdc+md1.do_sync = 0 dev.flashcache.sdc+md1.stop_sync = 0 dev.flashcache.sdc+md1.dirty_thresh_pct = 30 dev.flashcache.sdc+md1.max_clean_ios_total = 4 dev.flashcache.sdc+md1.max_clean_ios_set = 2 dev.flashcache.sdc+md1.do_pid_expiry = 0 dev.flashcache.sdc+md1.max_pids = 100 dev.flashcache.sdc+md1.pid_expiry_secs = 60 dev.flashcache.sdc+md1.reclaim_policy = 1 dev.flashcache.sdc+md1.zero_stats = 0 dev.flashcache.sdc+md1.fast_remove = 0 dev.flashcache.sdc+md1.cache_all = 1 dev.flashcache.sdc+md1.fallow_clean_speed = 2 dev.flashcache.sdc+md1.fallow_delay = 900 dev.flashcache.sdc+md1.skip_seq_thresh_kb = 512

iostat -xkN 5

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 22.20 18.80 88.80 75.20 8.00 0.01 0.30 0.12 0.48 sdb 153.60 68.60 24.00 73.20 349.20 313.70 13.64 0.64 6.73 4.63 45.02 sda 0.00 74.60 0.00 67.00 0.00 313.70 9.36 4.06 60.90 14.93 100.00 md1 0.00 0.00 177.60 123.00 349.20 266.20 4.09 0.00 0.00 0.00 0.00

i run fio: fio -filename=/home/kvm/test.iso -iodepth=64 -ioengine=libaio -direct=1 -rw=randread -bs=4k -size=5G -numjobs=64 -runtime=20 -group_reporting -name=test-rand-read test-rand-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 ... test-rand-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.0.13 Starting 64 processes Jobs: 64 (f=64): [rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr] [2.6% done] [151.8M/0K/0K /s] [38.9K/0 /0 iops] [eta 13m:57s] test-rand-read: (groupid=0, jobs=64): err= 0: pid=8411: Sun Jul 28 11:09:46 2013 read : io=8764.3MB, bw=421936KB/s, iops=105484 , runt= 21270msec slat (usec): min=3 , max=133025 , avg=567.80, stdev=4786.41 clat (usec): min=0 , max=4819.2K, avg=38066.10, stdev=82945.67 lat (usec): min=4 , max=4819.2K, avg=38634.18, stdev=83316.49 clat percentiles (usec): | 1.00th=[ 270], 5.00th=[ 370], 10.00th=[ 438], 20.00th=[ 548], | 30.00th=[ 684], 40.00th=[ 1320], 50.00th=[37632], 60.00th=[40192], | 70.00th=[42752], 80.00th=[46848], 90.00th=[87552], 95.00th=[121344], | 99.00th=[166912], 99.50th=[197632], 99.90th=[1286144], 99.95th=[1531904], | 99.99th=[2768896] bw (KB/s) : min= 507, max=14088, per=1.61%, avg=6808.31, stdev=4171.33 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.02% lat (usec) : 100=0.02%, 250=0.57%, 500=14.67%, 750=17.86%, 1000=5.08% lat (msec) : 2=2.91%, 4=0.22%, 10=0.20%, 20=0.16%, 50=40.89% lat (msec) : 100=10.60%, 250=6.50%, 500=0.05%, 750=0.02%, 1000=0.02% lat (msec) : 2000=0.18%, >=2000=0.03% cpu : usr=0.40%, sys=2.23%, ctx=32528, majf=0, minf=5851 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=2243647/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=8764.3MB, aggrb=421936KB/s, minb=421936KB/s, maxb=421936KB/s, mint=21270msec, maxt=21270msec Disk stats (read/write): dm-6: ios=776769/1100, merge=0/0, ticks=7857780/789095, in_queue=9088783, util=100.00%, aggrios=776777/2930, aggrmerge=0/0, aggrticks=8264371/833165, aggrin_queue=9104612, aggrutil=100.00% dm-0: ios=776777/2930, merge=0/0, ticks=8264371/833165, in_queue=9104612, util=100.00%, aggrios=388410/3080, aggrmerge=1719/45, aggrticks=1574723/1890, aggrin_queue=1576267, aggrutil=94.36% md1: ios=7631/2568, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3299/1339, aggrmerge=511/1192, aggrticks=575079/114131, aggrin_queue=689327, aggrutil=99.83% sdb: ios=3973/1387, merge=688/1132, ticks=264205/30142, in_queue=294624, util=58.97% sda: ios=2626/1291, merge=334/1253, ticks=885953/198120, in_queue=1084030, util=99.83% sdc: ios=769190/3593, merge=3439/90, ticks=3149447/3781, in_queue=3152535, util=94.36% fio -filename=/home/kvm/test.iso -iodepth=64 -ioengine=libaio -direct=1 -rw=randwrite -bs=4k -size=5G -numjobs=64 -runtime=20 -group_reporting -name=test-rand-write test-rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 ... test-rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.0.13 Starting 64 processes Jobs: 3 (f=3): [_w__w____w__] [1.8% done] [0K/34393K/0K /s] [0 /8598 /0 iops] [eta 20m:19s] s] test-rand-write: (groupid=0, jobs=64): err= 0: pid=8944: Sun Jul 28 11:10:55 2013 write: io=2106.9MB, bw=102891KB/s, iops=25722 , runt= 20968msec slat (usec): min=8 , max=144353 , avg=1262.01, stdev=7815.16 clat (usec): min=1 , max=9287.5K, avg=152480.13, stdev=205294.09 lat (usec): min=57 , max=9287.5K, avg=153742.87, stdev=205287.22 clat percentiles (usec): | 1.00th=[ 37], 5.00th=[ 68], 10.00th=[ 3824], 20.00th=[66048], | 30.00th=[77312], 40.00th=[81408], 50.00th=[96768], 60.00th=[119296], | 70.00th=[146432], 80.00th=[181248], 90.00th=[378880], 95.00th=[552960], | 99.00th=[675840], 99.50th=[757760], 99.90th=[1564672], 99.95th=[3555328], | 99.99th=[6520832] bw (KB/s) : min= 7, max= 4536, per=1.62%, avg=1668.32, stdev=719.30 lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=2.85%, 100=3.67% lat (usec) : 250=1.06%, 500=0.66%, 750=0.24%, 1000=0.17% lat (msec) : 2=0.52%, 4=0.92%, 10=1.91%, 20=1.60%, 50=4.01% lat (msec) : 100=33.23%, 250=33.03%, 500=9.79%, 750=5.79%, 1000=0.25% lat (msec) : 2000=0.21%, >=2000=0.08% cpu : usr=0.27%, sys=1.64%, ctx=532549, majf=0, minf=1757 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=539357/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=2106.9MB, aggrb=102891KB/s, minb=102891KB/s, maxb=102891KB/s, mint=20968msec, maxt=20968msec Disk stats (read/write): dm-6: ios=621/541050, merge=0/0, ticks=34875/11477976, in_queue=11597827, util=100.00%, aggrios=621/556542, aggrmerge=0/0, aggrticks=34873/12636219, aggrin_queue=12754158, aggrutil=100.00% dm-0: ios=621/556542, merge=0/0, ticks=34873/12636219, in_queue=12754158, util=100.00%, aggrios=781/345480, aggrmerge=233/1828, aggrticks=223/869199, aggrin_queue=869039, aggrutil=89.63% md1: ios=1069/9700, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=98/3299, aggrmerge=436/6352, aggrticks=2997/198338, aggrin_queue=201479, aggrutil=93.94% sdb: ios=162/3355, merge=609/6303, ticks=1439/128383, in_queue=129639, util=50.57% sda: ios=35/3244, merge=263/6402, ticks=4556/268293, in_queue=273319, util=93.94% sdc: ios=494/681261, merge=466/3656, ticks=447/1738398, in_queue=1738078, util=89.63% and when i run fio, iostats shows:

iostat -xkN 5

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 40.40 95.40 32.80 31193.60 292.80 125156.00 8.03 55.64 1.79 0.03 92.28 sdb 11.60 147.00 1.20 204.80 25.60 1247.00 12.36 12.61 61.09 2.22 45.80 sda 0.00 150.00 0.00 191.20 0.00 1198.90 12.54 22.94 95.66 4.16 79.62 md1 0.00 0.00 12.80 356.80 25.60 1251.70 6.91 0.00 0.00 0.00 0.00 — Reply to this email directly or view it on GitHub.

garethbult commented 11 years ago

Mmm, you're using Windows and complaining performance is poor, the mind boggles ... ;)

If you don't set writeback, performance is likely to be far worse than you (I) might expect - your choice. If you want to use QCOW2 (I do) I'd recommend using NBD. Export your QCOW2 FS with qemu-nbd, then connect to it with nbd-client, then attach your FlashCache to the nbd device - this works fairly well, although from a management perspective it's a little tricky. I initially wrote a bunch of shell scripts to manage this for me for mass-production, sort of worked ok - but still not ideal.

Hopefully there will be a product available soon to do all this properly for KVM, ideally with network RAID10, compression, sparse storage, LFU cache etc etc .. :)

[if you look in detail at writeback, then at how the Linux page cache works, my impression / experience is that there is very little [if any] "additional" risk to using writeback]

pubyun commented 11 years ago

here is data. and how can tuning the system?

system with bad performance:

dmsetup table

fc_md1: 0 3902860928 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/md1) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(2953995), cache percent(10) dirty blocks(2203), dirty percent(0) nr_queued(0) Size Hist: 512:134485435 1024:13527429 1536:8480106 2048:3336358 2560:6835273 3072:12769809 3584:127742570 4096:118668580

512 134485435 31.58% 1024 13527429 3.18% 1536 8480106 1.99% 2048 3336358 0.78% 2560 6835273 1.61% 3072 12769809 3.00% 3584 127742570 30.00% 4096 118668580 27.87%

system with good performance:

dmsetup table

kvms: 0 2515083264 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/CentOS/kvm) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(19771545), cache percent(67) dirty blocks(29220), dirty percent(0) nr_queued(0) Size Hist: 512:234035511 1024:75332725 1536:74596710 2048:71343806 2560:73538973 3072:71947783 3584:223519365 4096:13261865088

512 234035511 1.66% 1024 75332725 0.53% 1536 74596710 0.53% 2048 71343806 0.51% 2560 73538973 0.52% 3072 71947783 0.51% 3584 223519365 1.59% 4096 13261865088 94.15%

mohans commented 11 years ago

Flashcache will only cache IOs that are exactly 4KB.

If you look at the system with the bad performance, notice that only 28% of all IOs are 4KB. 72% of all IOs are smaller, so all of these are uncached.

On the system with good performance, on the other hand nearly all IOs are 4KB.

There could be any number of reasons why IOs coming into flashcache are broken up like this. The first thing I'd check is to see if the start of the filesystem boundary is aligned at 4KB.


From: Peng Yong notifications@github.com To: facebook/flashcache flashcache@noreply.github.com Cc: Mohan Srinivasan mohan_srinivasan@yahoo.com Sent: Monday, July 29, 2013 9:07 AM Subject: Re: [flashcache] poor performance of KVM on flashcache (#132)

here is data. and how can tuning the system? system with bad performance:

dmsetup table

fc_md1: 0 3902860928 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/md1) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(2953995), cache percent(10) dirty blocks(2203), dirty percent(0) nr_queued(0) Size Hist: 512:134485435 1024:13527429 1536:8480106 2048:3336358 2560:6835273 3072:12769809 3584:127742570 4096:118668580 512 134485435 31.58% 1024 13527429 3.18% 1536 8480106 1.99% 2048 3336358 0.78% 2560 6835273 1.61% 3072 12769809 3.00% 3584 127742570 30.00% 4096 118668580 27.87% system with good performance:

dmsetup table

kvms: 0 2515083264 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/CentOS/kvm) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(19771545), cache percent(67) dirty blocks(29220), dirty percent(0) nr_queued(0) Size Hist: 512:234035511 1024:75332725 1536:74596710 2048:71343806 2560:73538973 3072:71947783 3584:223519365 4096:13261865088 512 234035511 1.66% 1024 75332725 0.53% 1536 74596710 0.53% 2048 71343806 0.51% 2560 73538973 0.52% 3072 71947783 0.51% 3584 223519365 1.59% 4096 13261865088 94.15% — Reply to this email directly or view it on GitHub.

garethbult commented 11 years ago

Most [99.9+%] NBD requests are multiples of 4KB, which I guess is why it works well with FlashCache ... :)

pubyun commented 11 years ago

i find many documents about disk partition align, and no glue to handle it.

how can i check if my partition is aligned correctly? and how can i realigned it if it's wrong?

i use a kickstart file to partition the CentOS system:

/usr/sbin/parted -s -- /dev/$drive1 mklabel gpt
/usr/sbin/parted -s -- /dev/$drive2 mklabel gpt
/usr/sbin/parted -s -- /dev/$drive1 unit MB mkpart primary 1 5
/usr/sbin/parted -s -- /dev/$drive2 unit MB mkpart primary 1 5
/usr/sbin/parted -s -- /dev/$drive1 set 1 bios_grub on
/usr/sbin/parted -s -- /dev/$drive2 set 1 bios_grub on
/usr/sbin/parted -s -- /dev/$drive1 unit MB mkpart primary 5 2000
/usr/sbin/parted -s -- /dev/$drive2 unit MB mkpart primary 5 2000
/usr/sbin/parted -s -- /dev/$drive1 set 2 boot on
/usr/sbin/parted -s -- /dev/$drive2 set 2 boot on
/usr/sbin/parted -s -- /dev/$drive1 unit MB mkpart primary 2000 -0
/usr/sbin/parted -s -- /dev/$drive2 unit MB mkpart primary 2000 -0
/usr/sbin/parted -s -- /dev/$drive1 set 3 raid on
/usr/sbin/parted -s -- /dev/$drive2 set 3 raid on

parted /dev/sda

GNU Parted 2.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) unit s (parted) p Model: ATA WDC WD2003FYYS-0 (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/4096B Partition Table: gpt

Number Start End Size File system Name Flags 1 2048s 10239s 8192s primary 2 10240s 3905535s 3895296s ext4 primary boot 3 3905536s 3907028991s 3903123456s primary raid

is this disk aligned to 4kb boundary?

then i have a md on /dev/sda3 and /dev/dev/sdb3:

cat /proc/mdstat

Personalities : [raid1] md0 : active raid1 sda2[0] sdb2[1] 1947584 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sda3[0] sdb3[1] 1951430464 blocks super 1.1 [2/2] [UU] bitmap: 4/15 pages [16KB], 65536KB chunk

i have a flashcache device on md1:

dmsetup table

vg0-test: 0 209715200 linear 253:0 2525366272 vg0-home: 0 62914560 linear 253:0 29362176 vg0-home: 62914560 146800640 linear 253:0 2378565632 fc_md1: 0 3902860928 flashcache conf: ssd dev (/dev/sdc), disk dev (/dev/md1) cache mode(WRITE_BACK) capacity(114026M), associativity(512), data block size(4K) metadata block size(4096b) skip sequential thresh(512K) total blocks(29190656), cached blocks(2953082), cache percent(10) dirty blocks(1378), dirty percent(0) nr_queued(0) Size Hist: 512:139363574 1024:14066635 1536:8798657 2048:3482262 2560:7094672 3072:13270678 3584:132310668 4096:122190926

then i create LVM group and LVM logical Volum:

pvdisplay

--- Physical volume --- PV Name /dev/mapper/fc_md1 VG Name vg0 PV Size 1.82 TiB / not usable 29.81 MiB Allocatable yes PE Size 32.00 MiB Total PE 59552 Free PE 17818 Allocated PE 41734 PV UUID ftvUj9-1SeY-69so-jqyI-hoAN-A0hE-nIGYet

lvdisplay

--- Logical volume --- LV Path /dev/vg0/swap LV Name swap VG Name vg0 LV UUID fmuFtp-NiV5-KNNC-G7i1-L2hf-gUAL-nQIgri LV Write Access read/write LV Creation host, time install.bitcomm.cn, 2013-06-21 19:36:00 +0800 LV Status available

open 1

LV Size 2.00 GiB Current LE 64 Segments 1 Allocation inherit Read ahead sectors auto

and i put all KVM qcow2 files under device /dev/vg0/nova

pubyun commented 11 years ago

i run align-check command of parted on CentOS, it give me no message:

parted /dev/sda align-check opt 1

if i run it on Ubuntu, it give me "1 aligned" message:

parted /dev/sda align-check opt 1

1 aligned

the partition type of ubuntu system is msdos instead of gpt.

parted /dev/sda

GNU Parted 2.3 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) unit s (parted) p Model: ATA WDC WD2003FYYS-0 (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/512B Partition Table: msdos

Number Start End Size Type File system Flags 1 2048s 499711s 497664s primary raid 2 499712s 3907028991s 3906529280s primary raid

mohans commented 11 years ago

If you are running the MSDOS filesystem on this, it is likely that the filesystem is issuing most of its IOs as < 4KB. If that is the case, flashcache cannot really do anything about it :(


From: Peng Yong notifications@github.com To: facebook/flashcache flashcache@noreply.github.com Cc: Mohan Srinivasan mohan_srinivasan@yahoo.com Sent: Monday, July 29, 2013 5:32 PM Subject: Re: [flashcache] poor performance of KVM on flashcache (#132)

i run align-check command of parted on CentOS, it give me no message:

parted /dev/sda align-check opt 1

if i run it on Ubuntu, it give me "1 aligned" message:

parted /dev/sda align-check opt 1

1 aligned the partition type of ubuntu system is msdos instead of gpt.

parted /dev/sda

GNU Parted 2.3 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) unit s (parted) p Model: ATA WDC WD2003FYYS-0 (scsi) Disk /dev/sda: 3907029168s Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 2048s 499711s 497664s primary raid 2 499712s 3907028991s 3906529280s primary raid — Reply to this email directly or view it on GitHub.

pubyun commented 11 years ago

@mohans, thanks.

i create a qcow2 image, and part it by linux parted first, then install windows on it. it works and cache hit is OK now

ixp2xxx commented 8 years ago

what's command for parted qcow2 image 4k align?