Closed andrijapanic-dont-use-this-one closed 5 years ago
@andrijapanic Hi!
You don't say which filesystem you are doing this all into but a filesystem is well within its rights to fall back to buffered I/O on Linux. O_DIRECT
on Linux is more of a hint - just because the I/O was accepted you don't know for sure whether it really will be sent without buffering. In fact we can see the telltale signs of buffered I/O in the output you correctly chose to provide:
lat (usec): min=2, max=2424.0k, avg=6176.00, stdev=32874.65
Look at that minimum latency of 2 microseconds - is that realistic for your setup? Also see this:
clat percentiles (nsec):
| 1.00th=[ 740], 5.00th=[ 788], 10.00th=[ 812],
| 20.00th=[ 836], 30.00th=[ 860], 40.00th=[ 900],
| 50.00th=[ 956], 60.00th=[ 1848], 70.00th=[ 6112],
AT least 70% of I/Os completed in under 7000 nanoseconds / 7 microseconds. Is this result obtainable on your hardware if the I/Os are actually going down to the real disk?
Hi there,
thx for this hint. I believe I mentioned, it's ext4 fs on this partition (-o nobarrier) - and of course it's not realistic.. (latency you mentioned).
But again, I see with iostat barely 32MB/s and also before doing any reads, I drop cache, as per commands above - so I don't find it possible (though I might be wrong...) that any data is read from any buffer (OS RAM in this case).
When I increased the testfile size to 5TB (across 25TB of net space...), I was getting 500K read IOPS...
Any hints as to how to check more things ?
FYI - I'm always using ext4 and direct=1, and have never before seen such behavior...
thx
@andrijapanic Wait a moment - are you writing anything into your file before reading it back? And/or are you dealing with SSDs? This is important because if you are reading back "unwritten" data your storage stack can choose to take all sorts of shortcuts. Try doing a write job to say 100G then reading just that 100G back...
no, these are HDDs, 14 x 4TB in RAID10 (software raid, dm-raid). When you start benchmar (read or write), first fio will laydown a test file (takes time to write these 100GB), and only then it will start the benchmark job. I have runned script read/write in random and sequentila tests. Next run, again same read "issues" (before that there was random write and seq.write done anyway)...
Hmm, looks like you've got that angle covered off :-)
OK I'm going to fallback to my standby of "perhaps your filesystem is buffering behind your back". Ted Ts'o has stated that a filesystem can choose to do this (see https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1279956.html ). Is your ext4
filesystem set to use special journalling options. Also if you can afford to DESTROY YOUR FILESYSTEM can you see if the results look more sane going to your md device?
If it was an empty device condition, the vmstat/iostat would still show the same performance metrics as fio. For cases like this, where fio thinks it's doing a lot more IOPS than the device, it is usually caching. What you are observing is that fio is indeed doing that many IOPS, there's no lying there, but for some reason the backend device(s) are seeing much fewer. Caching would explain that, or IO merging. The latter would be observable in the disk stats output, however, and is usually only a real factor when you are doing sequential IO.
I will try to reproduce, have to reinstall whole system anyway... problem is I never had issues like this - but again, let me reproduce on brand new OS setup...
thx
Beside that, I find output confusing in regards to 2 following lines
read: IOPS=82.8k, BW=324MiB/s (339MB/s)(19.0GiB/60144msec) and iops : min=14713, max=24863, avg=20825.56, stdev=1576.35, samples=480
One result is from total calculation (total size of I/O or IOPS done divided by time) and the other is done by averaging samples. See things like the bw section of https://fio.readthedocs.io/en/latest/fio_doc.html#interpreting-the-output for another explanation.
Thx for clarification - I have just read the link you sent - in that link, both counted/calculated IOPS and sampled average iops are equal (88 iops in that example), while my calculated is 82K, and sampled is 20K - quite a bit of difference - does this makes sense ? Sorry if silly question...
@andrijapanic In regard to the different stat results, I think you might be seeing the same issue mentioned over in https://github.com/axboe/fio/issues/519#issuecomment-360721747 ...
So... reinstalled OS from scratch, identical kernel and fio versions, same ext4 fs - the only difference this time I made mdraid on top of same set of disks - but on raw disk (no partitions) and then rest is the same (ext4 on mdraid10 partition etc) - and I can NOT reproduce this anymore...
I'm not sure what was happening previously...
Is there any possibility, that previously (due to forcefully killing fio processes), that the test file (100GB) was not really written/prepared/layout fully, just allocated on FS, and this later lead to fio reading from it (benchmark) with such high values or something... ?
Yes, that's a good point, if you have a sparse file you'd see exactly this kind of behavior - most of the IOs would just return zeroes, and only a subset of them would actually need to hit disk when we didn't read from a hole.
Confirmed the issue artificially:
Jobs: 4 (f=4): [r(4)][2.0%][r=3146MiB/s,w=0KiB/s][r=805k,w=0 IOPS][eta 09m:49s]
Although iostat/iotop shows zero IO (since file 100% empty, vs previous case where some parts of the file were writen actually by FIO process...)
Well, my apologies for wasting your time here :) I though I was going mad...
Thanks a lot for clarification
@andrijapanic I think that's what I was trying to get at when I wrote "are you writing anything into your file before reading it back? [...] Try doing a write job to say 100G then reading just that 100G back" - I assumed you were writing to all 100G :-)
At any rate, I'm glad you go to the bottom of the issue! I'll close this issue.
@andrijapanic Oops I got fooled by an old open tab but I'll leave my longer answer here for posterity.
You're correct that the problem is due to unwritten data being read back but it's not due to what we regularly think of as a sparse file...
Some background: fio
only writes to every part of a file during pre-layout for certain types of jobs when it knows the file doesn't exist or isn't big enough (in which case it can do an extending allocation). If your file doesn't exist AND you do a random write job AND you interrupt that random write job before it has written to every region of the file THEN you can end up with unwritten parts of the file (but technically the file is not sparse)...
OK so let's check some stuff:
$ # Sparse file
$ rm -f fio.tmp; dd if=/dev/zero of=fio.tmp bs=1 count=0 seek=10G
0+0 records in
0+0 records out
0 bytes copied, 0.000222564 s, 0.0 kB/s
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
0 fio.tmp
--
Size: 10737418240 Blocks: 0 IO Block: 4096 regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
fio.tmp: 0 extents found
---
$ # Fully written file
$ rm fio.tmp; ./fio --name=pureread --iodepth=128 --bs=4k --size=10G --ioengine=libaio --filename=fio.tmp --direct=1 --runtime=2 --time_based --rw=randread
pureread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.12-20-g4cf3
Starting 1 process
pureread: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [r(1)][-.-%][r=300MiB/s][r=76.8k IOPS][eta 00m:00s]
pureread: (groupid=0, jobs=1): err= 0: pid=19933: Mon Dec 10 06:24:50 2018
read: IOPS=76.4k, BW=299MiB/s (313MB/s)(598MiB/2002msec)
slat (usec): min=2, max=320, avg= 6.65, stdev= 4.84
clat (usec): min=264, max=4248, avg=1665.54, stdev=711.49
lat (usec): min=271, max=4256, avg=1672.38, stdev=711.51
clat percentiles (usec):
| 1.00th=[ 437], 5.00th=[ 553], 10.00th=[ 685], 20.00th=[ 938],
| 30.00th=[ 1172], 40.00th=[ 1418], 50.00th=[ 1663], 60.00th=[ 1909],
| 70.00th=[ 2147], 80.00th=[ 2376], 90.00th=[ 2638], 95.00th=[ 2769],
| 99.00th=[ 2966], 99.50th=[ 3064], 99.90th=[ 3195], 99.95th=[ 3261],
| 99.99th=[ 3458]
bw ( KiB/s): min=302200, max=309648, per=100.00%, avg=305778.00, stdev=3318.86, samples=4
iops : min=75550, max=77412, avg=76444.50, stdev=829.72, samples=4
lat (usec) : 500=2.84%, 750=9.80%, 1000=9.96%
lat (msec) : 2=41.14%, 4=36.27%, 10=0.01%
cpu : usr=10.24%, sys=61.72%, ctx=17382, majf=0, minf=141
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=153016,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=299MiB/s (313MB/s), 299MiB/s-299MiB/s (313MB/s-313MB/s), io=598MiB (627MB), run=2002-2002msec
Disk stats (read/write):
dm-1: ios=151064/0, merge=0/0, ticks=246612/0, in_queue=246632, util=95.04%, aggrios=153002/0, aggrmerge=14/0, aggrticks=250008/0, aggrin_queue=249984, aggrutil=92.98%
sda: ios=153002/0, merge=14/0, ticks=250008/0, in_queue=249984, util=92.98%
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
10G fio.tmp
--
Size: 10737418240 Blocks: 20971520 IO Block: 4096 regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 2097150: 9844404.. 11941554: 2097151:
1: 2097151.. 2621439: 11941555.. 12465843: 524289: last,eof
fio.tmp: 1 extent found
---
$ # Partially written file (fio write job is cut short)
$ rm fio.tmp; ./fio --name=partialwrite --iodepth=128 --bs=4k --size=10G --ioengine=libaio --filename=fio.tmp --direct=1 --runtime=2 --time_based --rw=randwrite
partialwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.12-20-g4cf3
Starting 1 process
partialwrite: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1)
partialwrite: (groupid=0, jobs=1): err= 0: pid=20436: Mon Dec 10 06:27:22 2018
write: IOPS=58.8k, BW=230MiB/s (241MB/s)(460MiB/2001msec); 0 zone resets
slat (usec): min=3, max=8786, avg=15.79, stdev=52.52
clat (usec): min=89, max=12701, avg=2156.85, stdev=661.89
lat (usec): min=869, max=12706, avg=2172.74, stdev=664.38
clat percentiles (usec):
| 1.00th=[ 1663], 5.00th=[ 1876], 10.00th=[ 1909], 20.00th=[ 1942],
| 30.00th=[ 1975], 40.00th=[ 1991], 50.00th=[ 2008], 60.00th=[ 2040],
| 70.00th=[ 2073], 80.00th=[ 2278], 90.00th=[ 2573], 95.00th=[ 2737],
| 99.00th=[ 3589], 99.50th=[ 4228], 99.90th=[11600], 99.95th=[11863],
| 99.99th=[12387]
bw ( KiB/s): min=201400, max=255024, per=99.94%, avg=235208.00, stdev=25556.36, samples=4
iops : min=50350, max=63756, avg=58802.00, stdev=6389.09, samples=4
lat (usec) : 100=0.01%, 1000=0.01%
lat (msec) : 2=45.08%, 4=54.31%, 10=0.24%, 20=0.36%
cpu : usr=8.00%, sys=83.05%, ctx=9781, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,117732,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
WRITE: bw=230MiB/s (241MB/s), 230MiB/s-230MiB/s (241MB/s-241MB/s), io=460MiB (482MB), run=2001-2001msec
Disk stats (read/write):
dm-1: ios=0/114784, merge=0/0, ticks=0/18148, in_queue=18176, util=93.49%, aggrios=0/120804, aggrmerge=0/3, aggrticks=0/20584, aggrin_queue=20568, aggrutil=91.68%
sda: ios=0/120804, merge=0/3, ticks=0/20584, in_queue=20568, util=91.68%
$ echo "---"; du -h fio.tmp; echo "--"; stat fio.tmp | grep Block; echo "--"; filefrag -e fio.tmp | head; echo "---"
---
11G fio.tmp
--
Size: 10737418240 Blocks: 20980048 IO Block: 4096 regular file
--
Filesystem type is: 58465342
File size of fio.tmp is 10737418240 (2621440 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 3: 9844404.. 9844407: 4: unwritten
1: 4.. 4: 9844408.. 9844408: 1:
2: 5.. 21: 9844409.. 9844425: 17: unwritten
3: 22.. 23: 9844426.. 9844427: 2:
4: 24.. 30: 9844428.. 9844434: 7: unwritten
5: 31.. 31: 9844435.. 9844435: 1:
6: 32.. 38: 9844436.. 9844442: 7: unwritten
---
So:
When it can, fio
will ask the filesystem to reserve the blocks needed to create the file because this leads to a better layout than allocating them on the fly (see https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-fallocate ). When a filesystem supports pre-allocation it can take the attitude that it should be "smart" enough not to let you read back data you haven't written (note: not all filesystems on all operating systems take this attitude). This can in turn be an optimization where the filesystem will return you zeros for data you never wrote and technically the file is not sparse!
At any rate the general warning is: make sure you have fully and entirely written (ideally with random enough data) the regions you later go on to read back from or you might see artificially fast results. Shortcuts can be taken by filesystems, block devices, disks etc. and unfortunately a lot of us are tripped up by this.
@andrijapanic Hopefully this explains what you were seeing.
[...] if you are reading back "unwritten" data your storage stack can choose to take all sorts of shortcuts
Hi there,
I guessed so (partially writen file / extents).
Thanks a lot for reproducing it - I understand that FS issues (recently have even learned some kernel issues when making FS on a partition on top of LV, instead of making FS directly on whole LV), etc. Thanks again!
Btw, already closed :) thx
CentOS 7 18.04 (updated), enabled EPEL repo, installed FIO 3.1 - details of setup and FIO output given below in details. Issue: iostat reports ~ 38MB/s (say 9200 x 4K IOps), while FIO reports 82.000 IOPS (incorrect). (this is random read test - but I saw also doubling of results on sequential read test, while random/sequentila WRITE seems correct)
Details of settup and issue:
Physical setup of hardware:
Fio command:
echo 3 > /proc/sys/vm/drop_caches; fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --iodepth=128 --numjobs=4 --rw=randread --bs=4k --size=100g --ioengine=libaio --filename=fio.tmp --direct=1 --group_reporting
FIO output (sorry for verbose thing - let me know if I should move it to pastebin)
Beside that, I find output confusing in regards to 2 following lines
I'm assuming first one is the results (in any FIO test), not sure what the other line means at all..
Thanks