Open Gt3pccb opened 4 years ago
@Gt3pccb Are you saying that you see the desired performance in the overwrite case (i.e. when the file has been fully written at least once already)? I understand that doing the first full set of writes is painfully slow but that's a slightly different thing to what I'm asking.
In place writes-overwrites are as expected.
iodepth_batch_submit and iodepth_batch_complete_max do not change the behavior.
So just to summarise, on Windows:
SetEndOfFile()
to help Windows allocate contiguous storage (see https://superuser.com/a/274867/ ) for the fileSetFileValidData()
call can quickly set the VDL but is a potential security risk because it exposes old data that was previously written to the disk and thus is a privileged callSetFileValidData()
the only way you can change the VDL to be the end of the file is by writing the whole file (or just the end of the file which will trigger backfill)DiskSpd essentially does 1. and tries to do 3. but if it can't then it falls back to 4. (writing the whole file) see https://github.com/microsoft/diskspd/blob/3c154ffdc34a44d51fa000c6fa61b9a826738e5f/IORequestGenerator/IORequestGenerator.cpp#L1894 .
(This follows on from what was discussed in https://github.com/axboe/fio/issues/833#issuecomment-538628196 )
I have got similar results, io submit and io complete seem not work, it always submit and complete in iodepth=4, even I set them as 32.
Job file:
[global] rw=randread direct=1 ioengine=windowsaio
size=1G bs=4k iodepth=32 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=32 iodepth_low=32
thread numjobs=2
filename=\.\PhysicalDrive1
group_reporting=1
[4kRandRead]
Results:
PS E:\temp> fio .\pm981.fio
4kRandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=windowsaio, iodepth=32
...
fio-3.16
Starting 2 threads
Jobs: 2 (f=2): [r(2)][-.-%][r=610MiB/s][r=156k IOPS][eta 00m:00s]
4kRandRead: (groupid=0, jobs=2): err= 0: pid=3648: Tue Feb 4 14:25:18 2020
read: IOPS=155k, BW=607MiB/s (636MB/s)(2048MiB/3374msec)
slat (usec): min=3, max=108, avg= 5.16, stdev= 2.41
clat (usec): min=85, max=771, avg=324.53, stdev=83.71
lat (usec): min=90, max=776, avg=329.69, stdev=83.21
clat percentiles (usec):
| 1.00th=[ 155], 5.00th=[ 194], 10.00th=[ 219], 20.00th=[ 253],
| 30.00th=[ 277], 40.00th=[ 302], 50.00th=[ 322], 60.00th=[ 343],
| 70.00th=[ 367], 80.00th=[ 392], 90.00th=[ 433], 95.00th=[ 469],
| 99.00th=[ 545], 99.50th=[ 570], 99.90th=[ 635], 99.95th=[ 668],
| 99.99th=[ 725]
bw ( KiB/s): min=613888, max=628480, per=100.00%, avg=622421.33, stdev=2926.66, samples=12
iops : min=153472, max=157120, avg=155605.33, stdev=731.67, samples=12
lat (usec) : 100=0.01%, 250=19.14%, 500=78.22%, 750=2.63%, 1000=0.01%
cpu : usr=0.00%, sys=29.68%, ctx=0, majf=0, minf=0
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.0%, 8=0.0%, 16=0.0%, 32=3.0%, 64=0.0%, >=64=0.0%
issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs): READ: bw=607MiB/s (636MB/s), 607MiB/s-607MiB/s (636MB/s-636MB/s), io=2048MiB (2147MB), run=3374-3374msec
Sitsofe, the queue differential is consistent across both AMD and Intel platforms, for all device types RAW/Initialized/File System.
As the title suggests when there is a difference between the indicated and observed-requested.
The storage subsystems configs are: • A 32x 10TB NvME Gen4 rulers in a PCIe-4 bus and an Intel CPU PCIe-4 (engineering).
• A 32x 10TB NvME Gen3 rulers in a PCIe-3 bus. • A 64 x 8TB SSDs. From our xperf traces on our filesystem we have observed that in (1) we are not the bottleneck so we are suspecting that FIO might not be submitting the appropriate requests. When using DiskSpd and CFStest, we have no problem sustaining and equivalent load.
For example
fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200--directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=10 --fallocate=truncate --unlink=1 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200 --directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=1 --fallocate=truncate --unlink=1
IO depths : 1=0.0%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
fio --thread --direct=1 --ioengine=windowsaio --time_based --ramp_time=60--runtime=1200--directory=v\:\FIO_Directory_1 --size=4000GB --rw=write --iodepth=64 --bs=1024K --nrfiles=10 --name=4xXR2 --numjobs=1 --fallocate=truncate --unlink=1 IO depths : 1=0.1%, 2=0.1%, 4=0.6%, 8=99.3%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
None of these specs yield a volume/fs queue depth of anything close to 100 queued IOs
The same is true symptom is valid for –iodepth=16,32,64,128,256,512,1024 in all storage subsystems. *we only test up to –iodepth=256 for SSDs
responding to Sitsofe's questions:
How much of a CPU is being taken up by fio alone? Is there any spare? CPU is “negligible” @around 2-5%
Does xperf also confirm a max outstanding depth of only 5-8 I/Os? Xperf and controller firmware confirm the observations with regards to queued depths recieved and latencies not being an issue.
If memory serves DiskSpd also sets the watermark (if needs be it will write the whole file once if it can't just mark all unwritten data as valid). Do things change if you use a per-existing file that's been entirely written at least once?
fio --thread --direct=1 --ioengine=windowsaio --directory=v\:\ --size=40GB --rw=write --iodepth=64 --bs=1024K --nrfiles=1 --name=4xXR2 --numjobs=10 --fallocate=truncate --group_reporting=1 –overwrite=1 (or running the same test twice)
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1% As for overwrite, this is one of the reasons why I suggested the set valid data, not only to fix the very slow first allocations; but because data in place writes behave differently than first allocations. The overwrite command is painfully slow, especially when testing with large (>1TB) and multiple files (>100s) it would take an unreasonable time to set up the tests.