Requested queued depths on Windows are different to those achieved

Gt3pccb commented 4 years ago

As the title suggests when there is a difference between the indicated and observed-requested.

The storage subsystems configs are: • A 32x 10TB NvME Gen4 rulers in a PCIe-4 bus and an Intel CPU PCIe-4 (engineering).
• A 32x 10TB NvME Gen3 rulers in a PCIe-3 bus. • A 64 x 8TB SSDs. From our xperf traces on our filesystem we have observed that in (1) we are not the bottleneck so we are suspecting that FIO might not be submitting the appropriate requests. When using DiskSpd and CFStest, we have no problem sustaining and equivalent load.

For example

fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200--directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=10 --fallocate=truncate --unlink=1 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200 --directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=1 --fallocate=truncate --unlink=1
IO depths : 1=0.0%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200--directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=10 --name=4xXR2 --numjobs=1 --fallocate=truncate --unlink=1 IO depths : 1=0.1%, 2=0.1%, 4=0.6%, 8=99.3%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%

None of these specs yield a volume/fs queue depth of anything close to 100 queued IOs

a 10-minute average of 1sec samples yields a 150 queued IOs
a 10-minute average of 1sec samples yields a 64 queued IOs
a 10-minute average of 1sec samples yields a 64 queued IOs

The same is true symptom is valid for –iodepth=16,32,64,128,256,512,1024 in all storage subsystems. *we only test up to –iodepth=256 for SSDs

responding to Sitsofe's questions:

How much of a CPU is being taken up by fio alone? Is there any spare? CPU is “negligible” @around 2-5%

Does xperf also confirm a max outstanding depth of only 5-8 I/Os? Xperf and controller firmware confirm the observations with regards to queued depths recieved and latencies not being an issue.

If memory serves DiskSpd also sets the watermark (if needs be it will write the whole file once if it can't just mark all unwritten data as valid). Do things change if you use a per-existing file that's been entirely written at least once?

fio --thread --direct=1 --ioengine=windowsaio --directory=v\:\ --size=40GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=10 --fallocate=truncate --group_reporting=1 –overwrite=1 (or running the same test twice)

IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1% As for overwrite, this is one of the reasons why I suggested the set valid data, not only to fix the very slow first allocations; but because data in place writes behave differently than first allocations. The overwrite command is painfully slow, especially when testing with large (>1TB) and multiple files (>100s) it would take an unreasonable time to set up the tests.

sitsofe commented 4 years ago

@Gt3pccb Are you saying that you see the desired performance in the overwrite case (i.e. when the file has been fully written at least once already)? I understand that doing the first full set of writes is painfully slow but that's a slightly different thing to what I'm asking.

Gt3pccb commented 4 years ago

In place writes-overwrites are as expected.

iodepth_batch_submit and iodepth_batch_complete_max do not change the behavior.

sitsofe commented 4 years ago

So just to summarise, on Windows:

It's good to use SetEndOfFile() to help Windows allocate contiguous storage (see https://superuser.com/a/274867/ ) for the file
But the amount of in-flight parallel I/O achieved will still be low on files whose Valid Data Length (VDL) (essentially) isn't the same as the End-Of-File (perhaps because of zero backfilling - see https://devblogs.microsoft.com/oldnewthing/20110922-00/?p=9573 )
The SetFileValidData() call can quickly set the VDL but is a potential security risk because it exposes old data that was previously written to the disk and thus is a privileged call
If you can't call SetFileValidData() the only way you can change the VDL to be the end of the file is by writing the whole file (or just the end of the file which will trigger backfill)

DiskSpd essentially does 1. and tries to do 3. but if it can't then it falls back to 4. (writing the whole file) see https://github.com/microsoft/diskspd/blob/3c154ffdc34a44d51fa000c6fa61b9a826738e5f/IORequestGenerator/IORequestGenerator.cpp#L1894 .

(This follows on from what was discussed in https://github.com/axboe/fio/issues/833#issuecomment-538628196 )

diogin commented 4 years ago

I have got similar results, io submit and io complete seem not work, it always submit and complete in iodepth=4, even I set them as 32.

Job file:

[global] rw=randread direct=1 ioengine=windowsaio

size=1G bs=4k iodepth=32 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=32 iodepth_low=32

thread numjobs=2

filename=\.\PhysicalDrive1

group_reporting=1

[4kRandRead]

Results: PS E:\temp> fio .\pm981.fio
4kRandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=32 ... fio-3.16 Starting 2 threads Jobs: 2 (f=2): [r(2)][-.-%][r=610MiB/s][r=156k IOPS][eta 00m:00s] 4kRandRead: (groupid=0, jobs=2): err= 0: pid=3648: Tue Feb 4 14:25:18 2020 read: IOPS=155k, BW=607MiB/s (636MB/s)(2048MiB/3374msec) slat (usec): min=3, max=108, avg= 5.16, stdev= 2.41 clat (usec): min=85, max=771, avg=324.53, stdev=83.71 lat (usec): min=90, max=776, avg=329.69, stdev=83.21 clat percentiles (usec): | 1.00th=[ 155], 5.00th=[ 194], 10.00th=[ 219], 20.00th=[ 253], | 30.00th=[ 277], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 343], | 70.00th=[ 367], 80.00th=[ 392], 90.00th=[ 433], 95.00th=[ 469], | 99.00th=[ 545], 99.50th=[ 570], 99.90th=[ 635], 99.95th=[ 668], | 99.99th=[ 725] bw ( KiB/s): min=613888, max=628480, per=100.00%, avg=622421.33, stdev=2926.66, samples=12 iops : min=153472, max=157120, avg=155605.33, stdev=731.67, samples=12 lat (usec) : 100=0.01%, 250=19.14%, 500=78.22%, 750=2.63%, 1000=0.01% cpu : usr=0.00%, sys=29.68%, ctx=0, majf=0, minf=0 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=97.0%, 8=0.0%, 16=0.0%, 32=3.0%, 64=0.0%, >=64=0.0% issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs): READ: bw=607MiB/s (636MB/s), 607MiB/s-607MiB/s (636MB/s-636MB/s), io=2048MiB (2147MB), run=3374-3374msec

Gt3pccb commented 2 years ago

Sitsofe, the queue differential is consistent across both AMD and Intel platforms, for all device types RAW/Initialized/File System.

axboe / fio

Requested queued depths on Windows are different to those achieved #875