axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.21k stars 1.26k forks source link

md5 verify failed but raw data in SSD is same with expected CRC #1771

Closed YanLi-SFX closed 1 month ago

YanLi-SFX commented 3 months ago

Please acknowledge you have done the following before creating a ticket

Description of the bug: Hi, I've encountered a few FIO errors similar to this when doing SSD disk testing:

sudo fio -filename=/mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 -size=29G -iodepth=128 -rw=write -bs=1024K -numjobs=1 -name=fiotest -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -buffer_compress_chunk=4k -buffer_compress_percentage=6 _fiotest: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 fio-3.13 Starting 1 process fiotest: Laying out IO file (1 file / 29696MiB) md5: verify failed at file /mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 offset 2694840320, length 1048576 (requested block: offset=2694840320, length=1048576) Expected CRC: 5109c53320c043225ef276d4535a6b32 Received CRC: 59c686884a95622702a18f57a494c5a4 fio: pid=3719168, err=84/file:io_u.c:2074, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character

fiotest: (groupid=0, jobs=1): err=84 (file:io_u.c:2074, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=3719168: Thu May 9 15:24:05 2024 read: IOPS=899, BW=857MiB/s (899MB/s)(2571MiB/3000msec) slat (nsec): min=11748, max=32619, avg=12298.16, stdev=1280.36 clat (usec): min=1773, max=153247, avg=144411.52, stdev=18433.31 lat (usec): min=1802, max=153262, avg=144423.85, stdev=18433.00 clat percentiles (msec): | 1.00th=[ 32], 5.00th=[ 148], 10.00th=[ 148], 20.00th=[ 148], | 30.00th=[ 148], 40.00th=[ 148], 50.00th=[ 148], 60.00th=[ 148], | 70.00th=[ 148], 80.00th=[ 148], 90.00th=[ 148], 95.00th=[ 148], | 99.00th=[ 148], 99.50th=[ 148], 99.90th=[ 150], 99.95th=[ 153], | 99.99th=[ 155] write: IOPS=767, BW=768MiB/s (805MB/s)(29.0GiB/38685msec); 0 zone resets slat (usec): min=1269, max=1645, avg=1294.18, stdev=29.28 clat (usec): min=176, max=181838, avg=165087.18, stdev=6610.13 lat (usec): min=1459, max=183120, avg=166381.43, stdev=6613.11 clat percentiles (msec): | 1.00th=[ 165], 5.00th=[ 165], 10.00th=[ 165], 20.00th=[ 165], | 30.00th=[ 165], 40.00th=[ 165], 50.00th=[ 165], 60.00th=[ 165], | 70.00th=[ 165], 80.00th=[ 165], 90.00th=[ 167], 95.00th=[ 169], | 99.00th=[ 178], 99.50th=[ 180], 99.90th=[ 182], 99.95th=[ 182], | 99.99th=[ 182] bw ( KiB/s): min=530432, max=792576, per=99.19%, avg=779669.95, stdev=39673.92, samples=78 iops : min= 518, max= 774, avg=761.38, stdev=38.74, samples=78 lat (usec) : 250=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.05%, 50=0.15% lat (msec) : 100=0.25%, 250=99.11% cpu : usr=98.21%, sys=1.35%, ctx=7855, majf=0, minf=717 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=2698,29696,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs): READ: bw=857MiB/s (899MB/s), 857MiB/s-857MiB/s (899MB/s-899MB/s), io=2571MiB (2696MB), run=3000-3000msec WRITE: bw=768MiB/s (805MB/s), 768MiB/s-768MiB/s (805MB/s-805MB/s), io=29.0GiB (31.1GB), run=38685-38685msec

Disk stats (read/write): nvme0n1: ios=5439/59899, merge=0/479, ticks=2256/5536, inqueue=7791, util=86.85%

But when I read from SSD nand , found the nand raw data is same with the expected CRC in fio output:

image

But this error is not often encountered , and could not be reproduced even with the same pattern. So I count all the situations where this error occurs:

FIO Version | Kernel | FIO pattern -- | -- | -- fio-3.13 | 4.18.0-348.el8.x86_64 | sudo fio -filename=/mnt/nvme0n1/mode_write-bs_1024K-qd_128-size_29G-loop_2325 -size=29G -iodepth=128 -rw=write -bs=1024K -numjobs=1 -name=fiotest -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -buffer_compress_chunk=4k -buffer_compress_percentage=6 fio-3.36 | 5.14.0-362.18.1.el9_3.x86_64 | sudo fio -filename=/mnt/nvme0n1/mode_randwrite-job_1-bs_322k-qd_64-size_5G -size=5G -iodepth=64 -rw=randwrite -bs=322k -numjobs=1 -name=unalign_ns_nvme0n1 -direct=1 -ioengine=libaio -group_reporting -do_verify=1 -verify=md5 -verify_interval=4096 -output=fioTest-mode_randwrite-job_1-bs_322k-qd_64-size_5G.sum -random_generator=tausworthe64 -buffer_compress_chunk=512 -buffer_compress_percentage=6 fio-3.13 | 4.18.0-348.el8.x86_64 | sudo fio -directory=/mnt/nvme1n1p1/disk_0-partitons-1-mode_randwrite-job_64-bs_987K-qd_32-size_7G -size=7G -iodepth=32 -rw=randwrite -bs=987K -numjobs=64 -name=disk_0_partitons_1_1716445886 -direct=1 -ioengine=posixaio -group_reporting -do_verify=1 -verify=md5 -verify_interval=4096 -output=fioTest-disk_0-partitons-1-mode_randwrite-job_64-bs_987K-qd_32-size_7G-26.sum -nrfiles=500 -random_generator=tausworthe64 -buffer_compress_chunk=512 -buffer_compress_percentage=100 fio-3.13 | Linux ubuntu-201-117 5.4.0-150-generic | Write first: sudo  fio --readwrite=randwrite --bs=4k --numjobs=1 --iodepth=32 --runtime=16h -time_based --group_reporting --ioengine=libaio --verify_backlog=1024 --do_verify=1 --size=100% --verify=md5 --filename=/dev/nvme0n1 --buffer_compress_percentage=1 --buffer_compress_chunk=4k --name=stress_randwriteThen read:sudo  fio --readwrite=read --bs=4k --numjobs=1 --iodepth=32 --runtime=16h -time_based --group_reporting --ioengine=libaio --verify_backlog=1024 --do_verify=1 --size=100% --verify=md5 --filename=/dev/nvme0n1 --buffer_compress_percentage=1 --buffer_compress_chunk=4k --name=stress_read

So now I'm wondering if there was an error when the SSD saved the data, or if there was an exception when the FIO was calibrated? I'm asking for your advice: 1, How FIO get the "Received CRC"? 2, Could you please provide me an idea to continue debugging? Many Thanks!

YanLi-SFX commented 3 months ago

It looks like there is a problem with the table display that I posted when I opened the bug: 2024-06-06-210509

vincentkfu commented 3 months ago

For each write operation fio by default has a header at the beginning of the buffer that includes the checksum and some other fields. The rest of the buffer has random data. What happened was that fio calculated an md5 checksum based on the data it read back from your file, compared it to the value in the header and found a mismatch. My advice is to try to simplify your job to the minimum set of options that trigger this issue. Then try to find a way to reliably reproduce it. You may also wish to run with --debug=io,verify to see if the debug messages shed any light on this issue. Finally, try the latest version of fio.

YanLi-SFX commented 3 months ago

Many thanks for your advice! I will try it and update if I find something new.