fio 3.16 doesn't emit error message for corrupted data/header

donny372 commented 4 years ago

Tested with 3.15: md5: verify failed at file /dev/hdi offset 2097152, length 4096 (requested block: offset=2097152, length=4096) Expected CRC: e2c1cad45e5f670ee41147bb177ee79f Received CRC: dd3183fbaff8b9f71864e2bf678d6afa 3;fio-3.15;read;0;84;...

Tested with 3.16: 3;fio-3.16;read;0;0;...

axboe commented 4 years ago

Please include the job file you are running, and any necessary details on how to reproduce.

axboe commented 4 years ago

I've tried both header and data failures with 3.16 and current -git, and I'm getting verify failures logged just fine. So you'll need to provide more information here. Also see:

https://git.kernel.dk/cgit/fio/tree/REPORTING-BUGS

for how to write a good bug report.

sitsofe commented 4 years ago

@donny372 is it possible for you follow up to @axboe's request? Just out of interest, you weren't running a rw=rw/randrw job with verification were you?

donny372 commented 4 years ago

My steps to run the test for corrupt HEADER:

Write data to disk: ./fio --minimal --name=write --rw=write --filename=/dev/hdm --bs=4k --size=2m --offset=0x1000 --verify=md5 --verify_pattern=0xff
Corrupt header: use some tool to writesector 0x1000 /dev/hdm
Read data from disk: ./fio --minimal --name=read --rw=read --filename=/dev/hdm --bs=4k --size=2m --offset=0x1000 --verify=md5 --verify_pattern=0xff

We should see some error message like: verify: bad magic header 0, wanted acca at file /dev/hdm offset 2097152, length 4096

Corrupt DATA is similar to corrupt header, just change step 2: use some tool to writesector 0x1010 /dev/hdm And the error message should be: md5: verify failed at file /dev/hdm offset 2097152, length 4096 (requested block: offset=2097152, length=4096) Expected CRC: e2c1cad45e5f670ee41147bb177ee79f Received CRC: dd3183fbaff8b9f71864e2bf678d6afa

sitsofe commented 4 years ago

@donny372

$ ./fio --minimal --name=write --rw=write --filename=fio.tmp --bs=4k --size=8k --offset=0x1000 --verify=md5 --verify_pattern=0xff
3;fio-3.16-65-gfd1ed-dirty;write;0;0;8;4000;1000;2;0;0;0.000000;0.000000;14;803;409.079000;557.546626;1.000000%=14;5.000000%=14;10.000000%=14;20.000000%=14;30.000000%=14;40.000000%=14;50.000000%=14;60.000000%=806;70.000000%=806;80.000000%=806;90.000000%=806;95.000000%=806;99.000000%=806;99.500000%=806;99.900000%=806;99.950000%=806;99.990000%=806;0%=0;0%=0;0%=0;16;804;410.043000;557.248227;0;0;0.000000%;0.000000;0.000000;8;8000;2000;1;0;0;0.000000;0.000000;10;65;37.746500;38.980676;1.000000%=10;5.000000%=10;10.000000%=10;20.000000%=10;30.000000%=10;40.000000%=10;50.000000%=10;60.000000%=65;70.000000%=65;80.000000%=65;90.000000%=65;95.000000%=65;99.000000%=65;99.500000%=65;99.900000%=65;99.950000%=65;99.990000%=65;0%=0;0%=0;0%=0;59;132;96.022500;51.343730;0;0;0.000000%;0.000000;0.000000;0.000000%;0.000000%;1;0;46;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.00%;0.00%;0.00%;50.00%;0.00%;25.00%;0.00%;0.00%;0.00%;25.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%
$ hexdump -C fio.tmp 
00000000  c3 bf 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  ca ac 02 00 00 10 00 00  00 00 00 00 00 00 00 00  |................|
00001010  00 10 00 00 00 00 00 00  43 eb 22 00 43 26 a0 24  |........C.".C&.$|
00001020  01 00 00 00 e0 55 39 d1  e2 c1 ca d4 5e 5f 67 0e  |.....U9.....^_g.|
00001030  e4 11 47 bb 17 7e e7 9f  ff ff ff ff ff ff ff ff  |..G..~..........|
00001040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00002000  ca ac 02 00 00 10 00 00  00 00 00 00 00 00 00 00  |................|
00002010  00 20 00 00 00 00 00 00  43 eb 22 00 30 85 a2 24  |. ......C.".0..$|
00002020  01 00 01 00 ff e4 dd f3  e2 c1 ca d4 5e 5f 67 0e  |............^_g.|
00002030  e4 11 47 bb 17 7e e7 9f  ff ff ff ff ff ff ff ff  |..G..~..........|
00002040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00003000

So byte 0x1010 (4112) is all zeros already and "changing" it to zero is just writing the data that's already there (which by definition is not corruption!). Let's write all ones instead:

$ LC_ALL=C tr '\0' '\377' < /dev/zero | dd of=fio.tmp bs=1 seek=4112 conv=notrunc count=1
1+0 records in
1+0 records out
1 bytes transferred in 0.002092 secs (478 bytes/sec)
$ hexdump -C fio.tmp 
00000000  c3 bf 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  ca ac 02 00 00 10 00 00  00 00 00 00 00 00 00 00  |................|
00001010  ff 10 00 00 00 00 00 00  43 eb 22 00 43 26 a0 24  |........C.".C&.$|
00001020  01 00 00 00 e0 55 39 d1  e2 c1 ca d4 5e 5f 67 0e  |.....U9.....^_g.|
00001030  e4 11 47 bb 17 7e e7 9f  ff ff ff ff ff ff ff ff  |..G..~..........|
00001040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00002000  ca ac 02 00 00 10 00 00  00 00 00 00 00 00 00 00  |................|
00002010  00 20 00 00 00 00 00 00  43 eb 22 00 30 85 a2 24  |. ......C.".0..$|
00002020  01 00 01 00 ff e4 dd f3  e2 c1 ca d4 5e 5f 67 0e  |............^_g.|
00002030  e4 11 47 bb 17 7e e7 9f  ff ff ff ff ff ff ff ff  |..G..~..........|
00002040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00003000

OK let's verify:

$ ./fio --minimal --name=read --rw=read --filename=fio.tmp --bs=4k --size=8k --offset=0x1000 --verify=md5 --verify_pattern=0xff
verify: bad header offset 4351, wanted 4096 at file fio.tmp offset 4096, length 4096 (requested block: offset=4096, length=4096)
fio: pid=16526, err=92/file:io_u.c:2031, func=io_u_sync_complete, error=Illegal byte sequence
3;fio-3.16-65-gfd1ed-dirty;read;0;92;4;4000;1000;1;0;0;0.000000;0.000000;51;51;51.870000;0.000000;1.000000%=51;5.000000%=51;10.000000%=51;20.000000%=51;30.000000%=51;40.000000%=51;50.000000%=51;60.000000%=51;70.000000%=51;80.000000%=51;90.000000%=51;95.000000%=51;99.000000%=51;99.500000%=51;99.900000%=51;99.950000%=51;99.990000%=51;0%=0;0%=0;0%=0;58;58;58.375000;0.000000;0;0;0.000000%;0.000000;0.000000;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;0.000000%;0.000000%;0;0;47;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.00%;0.00%;0.00%;0.00%;0.00%;100.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%

So I get the bad header message as expected and can't reproduce the error you described. Can you offer any more details?

donny372 commented 4 years ago

OK. I verified again, fio-3.16 doesn't work for me. I think the difference here is I'm writing to an unmounted drive and you are writing to a file.

axboe commented 4 years ago

Really shouldn't make any difference at all to fio, they are both flies as far as fio is concerned.

sitsofe commented 4 years ago

@donny372 I agree with Jens - I'm not going to get different results just because it's a file in a filesystem rather than a file representing a device. Just to reiterate: we still need some extra information from you:

Can you make the problem happen on the first I/O and (ideally) using relatively little I/O?
Can you show us the block contents before and after corruption?
How exactly are you corrupting the data?
Does repeating the steps I gave but changing the filename demonstrate an issue for you?
What is your disk (SATA SSD/iSCSI disk/NVMe etc)? /dev/hdm is an unusual device node these days...
What OS/kernel are you running?

donny372 commented 4 years ago

Not sure why, but I got different result in my way fio-3.15(and before versions) vs fio-3.16. I also followed your steps and got error message with fio-3.16.

donny372 commented 4 years ago

I found something interesting: The tool I'm using to corrupt data doesn't change the content before read, and fio-3.15 still can capture the error and after the read, the content updated.

/home/fio-3.15/fio_testdev --minimal --name=write --rw=write --filename=/dev/hdm --bs=4k --size=2m --offset=0x0000 --verify=md5 3;fio-3.15;write;0;0;2048;40156;10039;51;0;0;0.000000;0.000000;0;25526;57.821004;1128.933822;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=1;95.000000%=1;99.000000%=317;99.500000%=391;99.900000%=25559;99.950000%=25559;99.990000%=25559;0%=0;0%=0;0%=0;0;25526;57.846238;1128.933642;0;0;0.000000%;0.000000;0.000000;2048;227555;56888;9;0;0;0.000000;0.000000;1;130;4.265854;13.981036;1.000000%=1;5.000000%=1;10.000000%=1;20.000000%=2;30.000000%=2;40.000000%=2;50.000000%=2;60.000000%=2;70.000000%=2;80.000000%=2;90.000000%=3;95.000000%=4;99.000000%=93;99.500000%=123;99.900000%=130;99.950000%=130;99.990000%=130;0%=0;0%=0;0%=0;13;142;16.666918;14.122332;0;0;0.000000%;0.000000;0.000000;25.423729%;1.694915%;18;0;52;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;56.45%;39.65%;1.46%;0.20%;0.20%;0.39%;0.78%;0.68%;0.10%;0.00%;0.00%;0.00%;0.00%;0.00%;0.10%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;sdg;0;0;0;0;0;0;0;0.00% hexdump -C /dev/hdm -n 20 00000000 ca ac 02 00 00 10 00 00 3c 54 09 c5 01 cc a4 00 |ʬ......<T.�.̤.| 00000010 00 00 00 00 |....| 00000014 ./tool-danger datacmd writesector 0x0000 /dev/hdm hexdump -C /dev/hdm -n 20 The data doesn't change before read 00000000 ca ac 02 00 00 10 00 00 3c 54 09 c5 01 cc a4 00 |ʬ......<T.�.̤.| 00000010 00 00 00 00 |....| 00000014 /home/fio-3.15/fio_testdev --minimal --name=read --rw=read --filename=/dev/hdm --bs=4k --size=2m --offset=0x0000 --verify=md5 --verify_pattern=0xff verify: bad magic header 0, wanted acca at file /dev/hdm offset 0, length 4096 (requested block: offset=0, length=4096) fio: pid=586091, err=84/file:third_party/fio/io_u.c:2029, func=io_u_sync_complete, error=Invalid or incomplete multibyte or wide character 3;fio-3.15;read;0;84;4;137;34;29;0;0;0.000000;0.000000;27834;27834;27834.800000;0.000000;1.000000%=27918;5.000000%=27918;10.000000%=27918;20.000000%=27918;30.000000%=27918;40.000000%=27918;50.000000%=27918;60.000000%=27918;70.000000%=27918;80.000000%=27918;90.000000%=27918;95.000000%=27918;99.000000%=27918;99.500000%=27918;99.900000%=27918;99.950000%=27918;99.990000%=27918;0%=0;0%=0;0%=0;27837;27837;27837.187000;0.000000;0;0;0.000000%;0.000000;0.000000;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;0.000000%;0.000000%;1;0;85;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;100.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;sdg;0;0;0;0;0;0;0;0.00% hexdump -C /dev/hdm -n 20 The data changed after read 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

00000010

donny372 commented 4 years ago

Thanks @sitsofe and @axboe! I think it is something wrong with tool, closing this bug.

sitsofe commented 4 years ago

@donny372 something to note is that you're doing buffered I/O. If you are corrupting an LBA on the disk behind the kernel's back and if it (the kernel) has cached that LBA, it won't know it needs to reread it and will serve you data out of the cache. You may want to use direct=1 with fio to force bypass of the cache (the need to go direct would also apply to checking the region via a tool like dd too).

axboe / fio

fio 3.16 doesn't emit error message for corrupted data/header #872