Closed YanLi-SFX closed 1 month ago
It looks like there is a problem with the table display that I posted when I opened the bug:
For each write operation fio by default has a header at the beginning of the buffer that includes the checksum and some other fields. The rest of the buffer has random data. What happened was that fio calculated an md5 checksum based on the data it read back from your file, compared it to the value in the header and found a mismatch.
My advice is to try to simplify your job to the minimum set of options that trigger this issue. Then try to find a way to reliably reproduce it. You may also wish to run with --debug=io,verify
to see if the debug messages shed any light on this issue. Finally, try the latest version of fio.
Many thanks for your advice! I will try it and update if I find something new.
Please acknowledge you have done the following before creating a ticket
Description of the bug: Hi, I've encountered a few FIO errors similar to this when doing SSD disk testing:
sudo fio -filename=/mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 -size=29G -iodepth=128 -rw=write -bs=1024K -numjobs=1 -name=fiotest -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -buffer_compress_chunk=4k -buffer_compress_percentage=6 _fiotest: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 fio-3.13 Starting 1 process fiotest: Laying out IO file (1 file / 29696MiB) md5: verify failed at file /mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 offset 2694840320, length 1048576 (requested block: offset=2694840320, length=1048576) Expected CRC: 5109c53320c043225ef276d4535a6b32 Received CRC: 59c686884a95622702a18f57a494c5a4 fio: pid=3719168, err=84/file:io_u.c:2074, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
fiotest: (groupid=0, jobs=1): err=84 (file:io_u.c:2074, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=3719168: Thu May 9 15:24:05 2024 read: IOPS=899, BW=857MiB/s (899MB/s)(2571MiB/3000msec) slat (nsec): min=11748, max=32619, avg=12298.16, stdev=1280.36 clat (usec): min=1773, max=153247, avg=144411.52, stdev=18433.31 lat (usec): min=1802, max=153262, avg=144423.85, stdev=18433.00 clat percentiles (msec): | 1.00th=[ 32], 5.00th=[ 148], 10.00th=[ 148], 20.00th=[ 148], | 30.00th=[ 148], 40.00th=[ 148], 50.00th=[ 148], 60.00th=[ 148], | 70.00th=[ 148], 80.00th=[ 148], 90.00th=[ 148], 95.00th=[ 148], | 99.00th=[ 148], 99.50th=[ 148], 99.90th=[ 150], 99.95th=[ 153], | 99.99th=[ 155] write: IOPS=767, BW=768MiB/s (805MB/s)(29.0GiB/38685msec); 0 zone resets slat (usec): min=1269, max=1645, avg=1294.18, stdev=29.28 clat (usec): min=176, max=181838, avg=165087.18, stdev=6610.13 lat (usec): min=1459, max=183120, avg=166381.43, stdev=6613.11 clat percentiles (msec): | 1.00th=[ 165], 5.00th=[ 165], 10.00th=[ 165], 20.00th=[ 165], | 30.00th=[ 165], 40.00th=[ 165], 50.00th=[ 165], 60.00th=[ 165], | 70.00th=[ 165], 80.00th=[ 165], 90.00th=[ 167], 95.00th=[ 169], | 99.00th=[ 178], 99.50th=[ 180], 99.90th=[ 182], 99.95th=[ 182], | 99.99th=[ 182] bw ( KiB/s): min=530432, max=792576, per=99.19%, avg=779669.95, stdev=39673.92, samples=78 iops : min= 518, max= 774, avg=761.38, stdev=38.74, samples=78 lat (usec) : 250=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.05%, 50=0.15% lat (msec) : 100=0.25%, 250=99.11% cpu : usr=98.21%, sys=1.35%, ctx=7855, majf=0, minf=717 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=2698,29696,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs): READ: bw=857MiB/s (899MB/s), 857MiB/s-857MiB/s (899MB/s-899MB/s), io=2571MiB (2696MB), run=3000-3000msec WRITE: bw=768MiB/s (805MB/s), 768MiB/s-768MiB/s (805MB/s-805MB/s), io=29.0GiB (31.1GB), run=38685-38685msec
Disk stats (read/write): nvme0n1: ios=5439/59899, merge=0/479, ticks=2256/5536, inqueue=7791, util=86.85%
But when I read from SSD nand , found the nand raw data is same with the expected CRC in fio output:
But this error is not often encountered , and could not be reproduced even with the same pattern. So I count all the situations where this error occurs:
FIO Version | Kernel | FIO pattern -- | -- | -- fio-3.13 | 4.18.0-348.el8.x86_64 | sudo fio -filename=/mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 -size=29G -iodepth=128 -rw=write -bs=1024K -numjobs=1 -name=fiotest -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -buffer_compress_chunk=4k -buffer_compress_percentage=6 fio-3.36 | 5.14.0-362.18.1.el9_3.x86_64 | sudo fio -filename=/mnt/nvme0n1/mode_randwrite-job_1-bs_322k-qd_64-size_5G -size=5G -iodepth=64 -rw=randwrite -bs=322k -numjobs=1 -name=unalign_ns_nvme0n1 -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -verify_interval=4096 -output=fioTest-mode_randwrite-job_1-bs_322k-qd_64-size_5G.sum -random_generator=tausworthe64 -buffer_compress_chunk=512 -buffer_compress_percentage=6 fio-3.13 | 4.18.0-348.el8.x86_64 | sudo fio -directory=/mnt/nvme1n1p1/disk_0-partitons-1-mode_randwrite-job_64-bs_987K-qd_32-size_7G -size=7G -iodepth=32 -rw=randwrite -bs=987K -numjobs=64 -name=disk_0_partitons_1_1716445886 -direct=1 -ioengine=posixaio -group_reporting -do_verify=1 -verify=md5 -verify_interval=4096 -output=fioTest-disk_0-partitons-1-mode_randwrite-job_64-bs_987K-qd_32-size_7G-26.sum -nrfiles=500 -random_generator=tausworthe64 -buffer_compress_chunk=512 -buffer_compress_percentage=100 fio-3.13 | Linux ubuntu-201-117 5.4.0-150-generic | Write first: sudo fio --readwrite=randwrite --bs=4k --numjobs=1 --iodepth=32 --runtime=16h -time_based --group_reporting --ioengine=libaio --verify_backlog=1024 --do_verify=1 --size=100% --verify=md5 --filename=/dev/nvme0n1 --buffer_compress_percentage=1 --buffer_compress_chunk=4k --name=stress_randwriteThen read:sudo fio --readwrite=read --bs=4k --numjobs=1 --iodepth=32 --runtime=16h -time_based --group_reporting --ioengine=libaio --verify_backlog=1024 --do_verify=1 --size=100% --verify=md5 --filename=/dev/nvme0n1 --buffer_compress_percentage=1 --buffer_compress_chunk=4k --name=stress_readSo now I'm wondering if there was an error when the SSD saved the data, or if there was an exception when the FIO was calibrated? I'm asking for your advice: 1, How FIO get the "Received CRC"? 2, Could you please provide me an idea to continue debugging? Many Thanks!