Completely wrong (very high) IOPS reported

andrijapanic-dont-use-this-one commented 5 years ago

CentOS 7 18.04 (updated), enabled EPEL repo, installed FIO 3.1 - details of setup and FIO output given below in details. Issue: iostat reports ~ 38MB/s (say 9200 x 4K IOps), while FIO reports 82.000 IOPS (incorrect). (this is random read test - but I saw also doubling of results on sequential read test, while random/sequentila WRITE seems correct)

Details of settup and issue:


[root@kvm ~]# rpm -qa | grep fio
fio-3.1-2.el7.x86_64

[root@kvm ~]# uname -a
Linux kvm.local 3.10.0-862.14.4.el7.x86_64

Physical setup of hardware:

external JBOD chassis with multipath to all disks (multipathd running...)
using 14 x 4TB HDDs (HGST 7200 , 12G SAS) in md-raid 10 (raid10)
raid10 created with: mdadm --create --verbose /dev/md10 --level=10 --raid-devices=14 /dev/mapper/mpathe1 /dev/mapper/mpathq1 /dev/mapper/mpathc1 /dev/mapper/mpatho1 /dev/mapper/mpathb1 /dev/mapper/mpathn1 /dev/mapper/mpathm1 /dev/mapper/mpathl1 /dev/mapper/mpathk1 /dev/mapper/mpathj1 /dev/mapper/mpathi1 /dev/mapper/mpathh1 /dev/mapper/mpatht1 /dev/mapper/mpathg1 --spare-devices=2 /dev/mapper/mpaths1 /dev/mapper/mpathf1
final device/partition is /dev/md10, with /dev/md10p1 as ext4 partition, aligned to 1MB

Fio command:

echo 3 > /proc/sys/vm/drop_caches; fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --iodepth=128 --numjobs=4 --rw=randread --bs=4k --size=100g --ioengine=libaio --filename=fio.tmp --direct=1 --group_reporting

FIO output (sorry for verbose thing - let me know if I should move it to pastebin)

[root@kvm ~]# cat /28tb/results-RANDOMREAD1-26-11-2018-13-16-44.txt
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.1
Starting 4 processes

randread: (groupid=0, jobs=4): err= 0: pid=9992: Mon Nov 26 13:19:47 2018
   read: IOPS=82.8k, BW=324MiB/s (339MB/s)(19.0GiB/60144msec)
    slat (nsec): min=1446, max=36432k, avg=8315.01, stdev=116430.53
    clat (nsec): min=713, max=2423.0M, avg=6167108.36, stdev=32870377.97
     lat (usec): min=2, max=2424.0k, avg=6176.00, stdev=32874.65
    clat percentiles (nsec):
     |  1.00th=[      740],  5.00th=[      788], 10.00th=[      812],
     | 20.00th=[      836], 30.00th=[      860], 40.00th=[      900],
     | 50.00th=[      956], 60.00th=[     1848], 70.00th=[     6112],
     | 80.00th=[    16064], 90.00th=[  6520832], 95.00th=[ 34340864],
     | 99.00th=[137363456], 99.50th=[196083712], 99.90th=[379584512],
     | 99.95th=[497025024], 99.99th=[943718400]
   bw (  KiB/s): min=58853, max=99454, per=25.14%, avg=83303.69, stdev=6305.27, samples=480
   iops        : min=14713, max=24863, avg=20825.56, stdev=1576.35, samples=480
  lat (nsec)   : 750=1.68%, 1000=51.61%
  lat (usec)   : 2=10.42%, 4=1.06%, 10=10.96%, 20=6.01%, 50=4.23%
  lat (usec)   : 100=0.80%, 250=0.75%, 500=0.57%, 750=0.30%, 1000=0.17%
  lat (msec)   : 2=0.17%, 4=0.44%, 10=1.94%, 20=2.12%, 50=3.07%
  lat (msec)   : 100=2.03%, 250=1.39%, 500=0.24%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=5.00%, sys=18.54%, ctx=437533, majf=0, minf=28737
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=4982443,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=324MiB/s (339MB/s), 324MiB/s-324MiB/s (339MB/s-339MB/s), io=19.0GiB (20.4GB), run=60144-60144msec

Disk stats (read/write):
    md10: ios=547686/16, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=34230/26, aggrmerge=0/0, aggrticks=1901433/6637, aggrin_queue=1908139, aggrutil=99.91%
    dm-2: ios=38802/31, merge=0/0, ticks=1780201/6424, in_queue=1786693, util=99.82%, aggrios=38802/58, aggrmerge=0/1, aggrticks=1780626/6423, aggrin_queue=1787145, aggrutil=99.83%
    dm-0: ios=38802/58, merge=0/1, ticks=1780626/6423, in_queue=1787145, util=99.83%, aggrios=19404/29, aggrmerge=0/0, aggrticks=890047/3151, aggrin_queue=893195, aggrutil=99.84%
  sdb: ios=38805/58, merge=0/0, ticks=1779775/6303, in_queue=1786072, util=99.84%
  sdai: ios=3/0, merge=0/0, ticks=319/0, in_queue=319, util=0.53%
    dm-3: ios=39682/28, merge=0/0, ticks=3741744/10189, in_queue=3752058, util=99.83%, aggrios=39682/56, aggrmerge=0/0, aggrticks=3741212/10157, aggrin_queue=3751646, aggrutil=99.85%
    dm-1: ios=39682/56, merge=0/0, ticks=3741212/10157, in_queue=3751646, util=99.85%, aggrios=19844/28, aggrmerge=0/0, aggrticks=1870238/4993, aggrin_queue=1875230, aggrutil=99.84%
  sdc: ios=39685/56, merge=0/0, ticks=3740263/9987, in_queue=3750246, util=99.84%
  sdaj: ios=3/0, merge=0/0, ticks=214/0, in_queue=214, util=0.36%
    dm-28: ios=38957/28, merge=0/0, ticks=1962209/6517, in_queue=1968860, util=99.79%, aggrios=38957/56, aggrmerge=0/0, aggrticks=1962647/6517, aggrin_queue=1969230, aggrutil=99.78%
    dm-4: ios=38957/56, merge=0/0, ticks=1962647/6517, in_queue=1969230, util=99.78%, aggrios=19481/28, aggrmerge=0/0, aggrticks=981114/3179, aggrin_queue=984293, aggrutil=99.79%
  sde: ios=38960/56, merge=0/0, ticks=1961873/6359, in_queue=1968232, util=99.79%
  sdal: ios=3/0, merge=0/0, ticks=355/0, in_queue=355, util=0.59%
    dm-30: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    dm-5: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3/0, aggrmerge=0/0, aggrticks=35/0, aggrin_queue=35, aggrutil=0.06%
  sdf: ios=3/0, merge=0/0, ticks=38/0, in_queue=38, util=0.06%
  sdam: ios=3/0, merge=0/0, ticks=32/0, in_queue=32, util=0.05%
    dm-31: ios=39320/28, merge=0/0, ticks=2031857/8043, in_queue=2039905, util=99.82%, aggrios=39320/56, aggrmerge=0/0, aggrticks=2032298/8043, aggrin_queue=2040445, aggrutil=99.82%
    dm-7: ios=39320/56, merge=0/0, ticks=2032298/8043, in_queue=2040445, util=99.82%, aggrios=19663/28, aggrmerge=0/0, aggrticks=1015714/3909, aggrin_queue=1019621, aggrutil=99.81%
  sdh: ios=39323/56, merge=0/0, ticks=2031235/7818, in_queue=2039048, util=99.81%
  sdao: ios=3/0, merge=0/0, ticks=194/0, in_queue=194, util=0.32%
    dm-35: ios=39137/28, merge=0/0, ticks=2010529/7861, in_queue=2018471, util=99.88%, aggrios=39137/56, aggrmerge=0/0, aggrticks=2010967/7861, aggrin_queue=2019062, aggrutil=99.89%
    dm-8: ios=39137/56, merge=0/0, ticks=2010967/7861, in_queue=2019062, util=99.89%, aggrios=19571/28, aggrmerge=0/0, aggrticks=1005286/3837, aggrin_queue=1009122, aggrutil=99.88%
  sdi: ios=39140/56, merge=0/0, ticks=2010187/7674, in_queue=2017859, util=99.88%
  sdap: ios=3/0, merge=0/0, ticks=386/0, in_queue=386, util=0.64%
    dm-36: ios=39191/29, merge=0/0, ticks=1855863/6671, in_queue=1862574, util=99.80%, aggrios=39191/57, aggrmerge=0/0, aggrticks=1856297/6671, aggrin_queue=1863195, aggrutil=99.81%
    dm-11: ios=39191/57, merge=0/0, ticks=1856297/6671, in_queue=1863195, util=99.81%, aggrios=19598/28, aggrmerge=0/0, aggrticks=927822/3232, aggrin_queue=931051, aggrutil=99.80%
  sdl: ios=39194/57, merge=0/0, ticks=1855405/6465, in_queue=1861863, util=99.80%
  sdas: ios=3/0, merge=0/0, ticks=240/0, in_queue=240, util=0.40%
    dm-37: ios=39573/39, merge=0/0, ticks=1866153/6399, in_queue=1872568, util=99.76%, aggrios=39573/60, aggrmerge=0/7, aggrticks=1866597/6388, aggrin_queue=1873067, aggrutil=99.76%
    dm-10: ios=39573/60, merge=0/7, ticks=1866597/6388, in_queue=1873067, util=99.76%, aggrios=19789/30, aggrmerge=0/0, aggrticks=933027/3117, aggrin_queue=936141, aggrutil=99.76%
  sdk: ios=39576/60, merge=0/0, ticks=1865740/6235, in_queue=1871967, util=99.76%
  sdar: ios=3/0, merge=0/0, ticks=315/0, in_queue=315, util=0.52%
    dm-38: ios=38852/39, merge=0/0, ticks=1771468/6301, in_queue=1777805, util=99.83%, aggrios=38852/60, aggrmerge=0/7, aggrticks=1771910/6281, aggrin_queue=1778314, aggrutil=99.84%
    dm-9: ios=38852/60, merge=0/7, ticks=1771910/6281, in_queue=1778314, util=99.84%, aggrios=19429/30, aggrmerge=0/0, aggrticks=885620/3072, aggrin_queue=888691, aggrutil=99.83%
  sdj: ios=38855/60, merge=0/0, ticks=1770967/6145, in_queue=1777108, util=99.83%
  sdaq: ios=3/0, merge=0/0, ticks=274/0, in_queue=274, util=0.46%
    dm-39: ios=39384/29, merge=0/0, ticks=3356524/8615, in_queue=3365203, util=99.84%, aggrios=39384/57, aggrmerge=0/0, aggrticks=3356379/8573, aggrin_queue=3365041, aggrutil=99.84%
    dm-6: ios=39384/57, merge=0/0, ticks=3356379/8573, in_queue=3365041, util=99.84%, aggrios=19695/28, aggrmerge=0/0, aggrticks=1677803/4145, aggrin_queue=1681947, aggrutil=99.84%
  sdg: ios=39387/57, merge=0/0, ticks=3355295/8290, in_queue=3363582, util=99.84%
  sdan: ios=3/0, merge=0/0, ticks=312/0, in_queue=312, util=0.52%
    dm-40: ios=38986/29, merge=0/0, ticks=1794222/5988, in_queue=1800213, util=99.82%, aggrios=38986/57, aggrmerge=0/0, aggrticks=1794665/5988, aggrin_queue=1800674, aggrutil=99.83%
    dm-12: ios=38986/57, merge=0/0, ticks=1794665/5988, in_queue=1800674, util=99.83%, aggrios=19496/28, aggrmerge=0/0, aggrticks=897069/2919, aggrin_queue=899984, aggrutil=99.83%
  sdm: ios=38989/57, merge=0/0, ticks=1793793/5838, in_queue=1799622, util=99.83%
  sdat: ios=3/0, merge=0/0, ticks=346/0, in_queue=346, util=0.57%
    dm-41: ios=39181/31, merge=0/0, ticks=2275367/8768, in_queue=2284319, util=99.91%, aggrios=39181/58, aggrmerge=0/1, aggrticks=2275808/8768, aggrin_queue=2284711, aggrutil=99.92%
    dm-13: ios=39181/58, merge=0/1, ticks=2275808/8768, in_queue=2284711, util=99.92%, aggrios=19593/29, aggrmerge=0/0, aggrticks=1137373/4235, aggrin_queue=1141605, aggrutil=99.92%
  sdn: ios=39184/58, merge=0/0, ticks=2274685/8470, in_queue=2283149, util=99.92%
  sdau: ios=3/0, merge=0/0, ticks=62/0, in_queue=62, util=0.10%
    dm-42: ios=39020/28, merge=0/0, ticks=1863920/7211, in_queue=1871166, util=99.83%, aggrios=39020/56, aggrmerge=0/0, aggrticks=1864351/7211, aggrin_queue=1871599, aggrutil=99.83%
    dm-16: ios=39020/56, merge=0/0, ticks=1864351/7211, in_queue=1871599, util=99.83%, aggrios=19513/28, aggrmerge=0/0, aggrticks=931898/3530, aggrin_queue=935426, aggrutil=99.84%
  sdq: ios=39023/56, merge=0/0, ticks=1863479/7061, in_queue=1870536, util=99.84%
  sdax: ios=3/0, merge=0/0, ticks=317/0, in_queue=317, util=0.53%
    dm-43: ios=39024/29, merge=0/0, ticks=2213016/9256, in_queue=2222501, util=99.85%, aggrios=39024/57, aggrmerge=0/0, aggrticks=2213227/9221, aggrin_queue=2222550, aggrutil=99.85%
    dm-19: ios=39024/57, merge=0/0, ticks=2213227/9221, in_queue=2222550, util=99.85%, aggrios=19515/28, aggrmerge=0/0, aggrticks=1106191/4552, aggrin_queue=1110741, aggrutil=99.84%
  sdt: ios=39027/57, merge=0/0, ticks=2212223/9104, in_queue=2221323, util=99.84%
  sdba: ios=3/0, merge=0/0, ticks=160/0, in_queue=160, util=0.27%
    dm-44: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    dm-18: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3/0, aggrmerge=0/0, aggrticks=12/0, aggrin_queue=12, aggrutil=0.02%
  sds: ios=3/0, merge=0/0, ticks=14/0, in_queue=14, util=0.02%
  sdaz: ios=3/0, merge=0/0, ticks=10/0, in_queue=10, util=0.02%
    dm-45: ios=38577/28, merge=0/0, ticks=1899858/7963, in_queue=1907900, util=99.79%, aggrios=38577/56, aggrmerge=0/0, aggrticks=1900296/7963, aggrin_queue=1908241, aggrutil=99.79%
    dm-14: ios=38577/56, merge=0/0, ticks=1900296/7963, in_queue=1908241, util=99.79%, aggrios=19291/28, aggrmerge=0/0, aggrticks=949734/3839, aggrin_queue=953571, aggrutil=99.79%
  sdo: ios=38580/56, merge=0/0, ticks=1899269/7678, in_queue=1906942, util=99.79%
  sdav: ios=3/0, merge=0/0, ticks=200/0, in_queue=200, util=0.33%

Beside that, I find output confusing in regards to 2 following lines

read: IOPS=82.8k, BW=324MiB/s (339MB/s)(19.0GiB/60144msec) and
iops : min=14713, max=24863, avg=20825.56, stdev=1576.35, samples=480

I'm assuming first one is the results (in any FIO test), not sure what the other line means at all..

Thanks

sitsofe commented 5 years ago

@andrijapanic Hi!

You don't say which filesystem you are doing this all into but a filesystem is well within its rights to fall back to buffered I/O on Linux. O_DIRECT on Linux is more of a hint - just because the I/O was accepted you don't know for sure whether it really will be sent without buffering. In fact we can see the telltale signs of buffered I/O in the output you correctly chose to provide:

     lat (usec): min=2, max=2424.0k, avg=6176.00, stdev=32874.65

Look at that minimum latency of 2 microseconds - is that realistic for your setup? Also see this:

    clat percentiles (nsec):
     |  1.00th=[      740],  5.00th=[      788], 10.00th=[      812],
     | 20.00th=[      836], 30.00th=[      860], 40.00th=[      900],
     | 50.00th=[      956], 60.00th=[     1848], 70.00th=[     6112],

AT least 70% of I/Os completed in under 7000 nanoseconds / 7 microseconds. Is this result obtainable on your hardware if the I/Os are actually going down to the real disk?

andrijapanic-dont-use-this-one commented 5 years ago

Hi there,

thx for this hint. I believe I mentioned, it's ext4 fs on this partition (-o nobarrier) - and of course it's not realistic.. (latency you mentioned).

But again, I see with iostat barely 32MB/s and also before doing any reads, I drop cache, as per commands above - so I don't find it possible (though I might be wrong...) that any data is read from any buffer (OS RAM in this case).

When I increased the testfile size to 5TB (across 25TB of net space...), I was getting 500K read IOPS...

Any hints as to how to check more things ?

FYI - I'm always using ext4 and direct=1, and have never before seen such behavior...

thx

sitsofe commented 5 years ago

@andrijapanic Wait a moment - are you writing anything into your file before reading it back? And/or are you dealing with SSDs? This is important because if you are reading back "unwritten" data your storage stack can choose to take all sorts of shortcuts. Try doing a write job to say 100G then reading just that 100G back...

andrijapanic-dont-use-this-one commented 5 years ago

no, these are HDDs, 14 x 4TB in RAID10 (software raid, dm-raid). When you start benchmar (read or write), first fio will laydown a test file (takes time to write these 100GB), and only then it will start the benchmark job. I have runned script read/write in random and sequentila tests. Next run, again same read "issues" (before that there was random write and seq.write done anyway)...

sitsofe commented 5 years ago

Hmm, looks like you've got that angle covered off :-)

OK I'm going to fallback to my standby of "perhaps your filesystem is buffering behind your back". Ted Ts'o has stated that a filesystem can choose to do this (see https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1279956.html ). Is your ext4 filesystem set to use special journalling options. Also if you can afford to DESTROY YOUR FILESYSTEM can you see if the results look more sane going to your md device?

axboe commented 5 years ago

If it was an empty device condition, the vmstat/iostat would still show the same performance metrics as fio. For cases like this, where fio thinks it's doing a lot more IOPS than the device, it is usually caching. What you are observing is that fio is indeed doing that many IOPS, there's no lying there, but for some reason the backend device(s) are seeing much fewer. Caching would explain that, or IO merging. The latter would be observable in the disk stats output, however, and is usually only a real factor when you are doing sequential IO.

andrijapanic-dont-use-this-one commented 5 years ago

I will try to reproduce, have to reinstall whole system anyway... problem is I never had issues like this - but again, let me reproduce on brand new OS setup...

thx

sitsofe commented 5 years ago

Beside that, I find output confusing in regards to 2 following lines

read: IOPS=82.8k, BW=324MiB/s (339MB/s)(19.0GiB/60144msec) and iops : min=14713, max=24863, avg=20825.56, stdev=1576.35, samples=480

One result is from total calculation (total size of I/O or IOPS done divided by time) and the other is done by averaging samples. See things like the bw section of https://fio.readthedocs.io/en/latest/fio_doc.html#interpreting-the-output for another explanation.

andrijapanic-dont-use-this-one commented 5 years ago

Thx for clarification - I have just read the link you sent - in that link, both counted/calculated IOPS and sampled average iops are equal (88 iops in that example), while my calculated is 82K, and sampled is 20K - quite a bit of difference - does this makes sense ? Sorry if silly question...

sitsofe commented 5 years ago

@andrijapanic In regard to the different stat results, I think you might be seeing the same issue mentioned over in https://github.com/axboe/fio/issues/519#issuecomment-360721747 ...

andrijapanic-dont-use-this-one commented 5 years ago

So... reinstalled OS from scratch, identical kernel and fio versions, same ext4 fs - the only difference this time I made mdraid on top of same set of disks - but on raw disk (no partitions) and then rest is the same (ext4 on mdraid10 partition etc) - and I can NOT reproduce this anymore...

I'm not sure what was happening previously...

andrijapanic-dont-use-this-one commented 5 years ago

Is there any possibility, that previously (due to forcefully killing fio processes), that the test file (100GB) was not really written/prepared/layout fully, just allocated on FS, and this later lead to fio reading from it (benchmark) with such high values or something... ?

axboe commented 5 years ago

Yes, that's a good point, if you have a sparse file you'd see exactly this kind of behavior - most of the IOs would just return zeroes, and only a subset of them would actually need to hit disk when we didn't read from a hole.

andrijapanic-dont-use-this-one commented 5 years ago

Confirmed the issue artificially:

dd if=/dev/zero of=fio.tmp bs=1 count=0 seek=100G
fio --runtime=600 --time_based --clocksource=clock_gettime --name=randread --iodepth=128 --numjobs=4 --rw=randread --bs=4k --size=100g --ioengine=libaio --filename=fio.tmp --direct=1 --group_reporting

Jobs: 4 (f=4): [r(4)][2.0%][r=3146MiB/s,w=0KiB/s][r=805k,w=0 IOPS][eta 09m:49s]

Although iostat/iotop shows zero IO (since file 100% empty, vs previous case where some parts of the file were writen actually by FIO process...)

Well, my apologies for wasting your time here :) I though I was going mad...

Thanks a lot for clarification

sitsofe commented 5 years ago

@andrijapanic I think that's what I was trying to get at when I wrote "are you writing anything into your file before reading it back? [...] Try doing a write job to say 100G then reading just that 100G back" - I assumed you were writing to all 100G :-)

At any rate, I'm glad you go to the bottom of the issue! I'll close this issue.

sitsofe commented 5 years ago

@andrijapanic Oops I got fooled by an old open tab but I'll leave my longer answer here for posterity.

You're correct that the problem is due to unwritten data being read back but it's not due to what we regularly think of as a sparse file...

Some background: fio only writes to every part of a file during pre-layout for certain types of jobs when it knows the file doesn't exist or isn't big enough (in which case it can do an extending allocation). If your file doesn't exist AND you do a random write job AND you interrupt that random write job before it has written to every region of the file THEN you can end up with unwritten parts of the file (but technically the file is not sparse)...

OK so let's check some stuff:

$ # Sparse file
$ rm -f fio.tmp; dd if=/dev/zero of=fio.tmp bs=1 count=0 seek=10G
0+0 records in
0+0 records out
0 bytes copied, 0.000222564 s, 0.0 kB/s
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
0   fio.tmp
--
  Size: 10737418240 Blocks: 0          IO Block: 4096   regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
fio.tmp: 0 extents found
---
$ # Fully written file
$ rm fio.tmp; ./fio --name=pureread --iodepth=128 --bs=4k --size=10G --ioengine=libaio --filename=fio.tmp --direct=1 --runtime=2 --time_based --rw=randread
pureread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.12-20-g4cf3
Starting 1 process
pureread: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][-.-%][r=300MiB/s][r=76.8k IOPS][eta 00m:00s]
pureread: (groupid=0, jobs=1): err= 0: pid=19933: Mon Dec 10 06:24:50 2018
  read: IOPS=76.4k, BW=299MiB/s (313MB/s)(598MiB/2002msec)
    slat (usec): min=2, max=320, avg= 6.65, stdev= 4.84
    clat (usec): min=264, max=4248, avg=1665.54, stdev=711.49
     lat (usec): min=271, max=4256, avg=1672.38, stdev=711.51
    clat percentiles (usec):
     |  1.00th=[  437],  5.00th=[  553], 10.00th=[  685], 20.00th=[  938],
     | 30.00th=[ 1172], 40.00th=[ 1418], 50.00th=[ 1663], 60.00th=[ 1909],
     | 70.00th=[ 2147], 80.00th=[ 2376], 90.00th=[ 2638], 95.00th=[ 2769],
     | 99.00th=[ 2966], 99.50th=[ 3064], 99.90th=[ 3195], 99.95th=[ 3261],
     | 99.99th=[ 3458]
   bw (  KiB/s): min=302200, max=309648, per=100.00%, avg=305778.00, stdev=3318.86, samples=4
   iops        : min=75550, max=77412, avg=76444.50, stdev=829.72, samples=4
  lat (usec)   : 500=2.84%, 750=9.80%, 1000=9.96%
  lat (msec)   : 2=41.14%, 4=36.27%, 10=0.01%
  cpu          : usr=10.24%, sys=61.72%, ctx=17382, majf=0, minf=141
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=153016,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=299MiB/s (313MB/s), 299MiB/s-299MiB/s (313MB/s-313MB/s), io=598MiB (627MB), run=2002-2002msec

Disk stats (read/write):
    dm-1: ios=151064/0, merge=0/0, ticks=246612/0, in_queue=246632, util=95.04%, aggrios=153002/0, aggrmerge=14/0, aggrticks=250008/0, aggrin_queue=249984, aggrutil=92.98%
  sda: ios=153002/0, merge=14/0, ticks=250008/0, in_queue=249984, util=92.98%
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
10G fio.tmp
--
  Size: 10737418240 Blocks: 20971520   IO Block: 4096   regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0.. 2097150:    9844404..  11941554: 2097151:            
   1:  2097151.. 2621439:   11941555..  12465843: 524289:             last,eof
fio.tmp: 1 extent found
---
$ # Partially written file (fio write job is cut short)
$ rm fio.tmp; ./fio --name=partialwrite --iodepth=128 --bs=4k --size=10G --ioengine=libaio --filename=fio.tmp --direct=1 --runtime=2 --time_based --rw=randwrite
partialwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.12-20-g4cf3
Starting 1 process
partialwrite: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1)
partialwrite: (groupid=0, jobs=1): err= 0: pid=20436: Mon Dec 10 06:27:22 2018
  write: IOPS=58.8k, BW=230MiB/s (241MB/s)(460MiB/2001msec); 0 zone resets
    slat (usec): min=3, max=8786, avg=15.79, stdev=52.52
    clat (usec): min=89, max=12701, avg=2156.85, stdev=661.89
     lat (usec): min=869, max=12706, avg=2172.74, stdev=664.38
    clat percentiles (usec):
     |  1.00th=[ 1663],  5.00th=[ 1876], 10.00th=[ 1909], 20.00th=[ 1942],
     | 30.00th=[ 1975], 40.00th=[ 1991], 50.00th=[ 2008], 60.00th=[ 2040],
     | 70.00th=[ 2073], 80.00th=[ 2278], 90.00th=[ 2573], 95.00th=[ 2737],
     | 99.00th=[ 3589], 99.50th=[ 4228], 99.90th=[11600], 99.95th=[11863],
     | 99.99th=[12387]
   bw (  KiB/s): min=201400, max=255024, per=99.94%, avg=235208.00, stdev=25556.36, samples=4
   iops        : min=50350, max=63756, avg=58802.00, stdev=6389.09, samples=4
  lat (usec)   : 100=0.01%, 1000=0.01%
  lat (msec)   : 2=45.08%, 4=54.31%, 10=0.24%, 20=0.36%
  cpu          : usr=8.00%, sys=83.05%, ctx=9781, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,117732,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=460MiB (482MB), run=2001-2001msec

Disk stats (read/write):
    dm-1: ios=0/114784, merge=0/0, ticks=0/18148, in_queue=18176, util=93.49%, aggrios=0/120804, aggrmerge=0/3, aggrticks=0/20584, aggrin_queue=20568, aggrutil=91.68%
  sda: ios=0/120804, merge=0/3, ticks=0/20584, in_queue=20568, util=91.68%
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
11G fio.tmp
--
  Size: 10737418240 Blocks: 20980048   IO Block: 4096   regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    9844404..   9844407:      4:             unwritten
   1:        4..       4:    9844408..   9844408:      1:            
   2:        5..      21:    9844409..   9844425:     17:             unwritten
   3:       22..      23:    9844426..   9844427:      2:            
   4:       24..      30:    9844428..   9844434:      7:             unwritten
   5:       31..      31:    9844435..   9844435:      1:            
   6:       32..      38:    9844436..   9844442:      7:             unwritten
---

So:

In the first case we create a sparse file. Notice how it has no blocks allocated to it.
In the second case we fully write to the file (courtesy of fio's pre-layout). Notice how it has 20971520 4KByte blocks but also note it is formed of one big extent allocated to it.
In the third case notice how 20980048 4KByte blocks have been allocated but there are lots of unwritten extents (I only show a few of them for brevity).

When it can, fio will ask the filesystem to reserve the blocks needed to create the file because this leads to a better layout than allocating them on the fly (see https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-fallocate ). When a filesystem supports pre-allocation it can take the attitude that it should be "smart" enough not to let you read back data you haven't written (note: not all filesystems on all operating systems take this attitude). This can in turn be an optimization where the filesystem will return you zeros for data you never wrote and technically the file is not sparse!

At any rate the general warning is: make sure you have fully and entirely written (ideally with random enough data) the regions you later go on to read back from or you might see artificially fast results. Shortcuts can be taken by filesystems, block devices, disks etc. and unfortunately a lot of us are tripped up by this.

@andrijapanic Hopefully this explains what you were seeing.

[...] if you are reading back "unwritten" data your storage stack can choose to take all sorts of shortcuts

andrijapanic-dont-use-this-one commented 5 years ago

Hi there,

I guessed so (partially writen file / extents).

Thanks a lot for reproducing it - I understand that FS issues (recently have even learned some kernel issues when making FS on a partition on top of LV, instead of making FS directly on whole LV), etc. Thanks again!

andrijapanic-dont-use-this-one commented 5 years ago

Btw, already closed :) thx

axboe / fio

Completely wrong (very high) IOPS reported #719