axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.06k stars 1.23k forks source link

fio failed with verify: bad header ,error=Illegal byte sequence - nrfiles=62 files and verify=md5 #1538

Open mosheasis opened 1 year ago

mosheasis commented 1 year ago

Please acknowledge the following before creating a ticket

Description of the bug: Section and write there that with nr_files=1 this issue doesnt occur

Environment: Red Hat Enterprise Linux 8.7

fio version: fio-3.33

Reproduction steps

{ "fio version" : "fio-3.33", "timestamp" : 1677777812, "time" : "Thu Mar 2 17:23:32 2023", "global options" : { "max_latency" : "20s", "nrfiles" : "62", "runtime" : "30", "numjobs" : "4", "ioengine" : "libaio", "direct" : "1", "verify" : "md5", "verify_dump" : "1", "directory" : "/mnt/volumez/attvol0", "size" : "250M", "time_based" : "1", "group_reporting" : "1", "lat_percentiles" : "1" }, "client_stats" : [ { "jobname" : "seq_0_100_128k", "groupid" : 0, "error" : 84, "job options" : { "rwmixread" : "0", "bs" : "128k", "iodepth" : "1", "rw" : "readwrite" },

horshack-dpreview commented 1 year ago

@mosheasis, Thanks for the submission. I reproduced the issue you found and have a root cause. To work around the issue please use experimental_verify=1 for your config; this uses an alternate mechanism to repeating the I/Os for verifying.

I simplified the failing config to the following:

[test]
nrfiles=2
numjobs=1
iodepth=1
ioengine=libaio
verify=md5
directory=/tmp/fio
size=4k
bs=512
rw=readwrite

Which yields the following error: verify: bad header rand_seed 9852480210356360750, wanted 46386204153304124 at file /tmp/fio/test.0.0 offset 0, length 512 (requested block: offset=0, length=512)

If we capture the I/Os via write_iolog=/tmp/iolog.txt we get the following:

fio version 3 iolog
53 /tmp/fio/test.0.0 add
63 /tmp/fio/test.0.1 add
327 /tmp/fio/test.0.0 open
339 /tmp/fio/test.0.0 read 0 512
393 /tmp/fio/test.0.1 open
398 /tmp/fio/test.0.1 read 0 512
414 /tmp/fio/test.0.0 read 512 512
442 /tmp/fio/test.0.1 write 0 512
462 /tmp/fio/test.0.0 read 1024 512
476 /tmp/fio/test.0.1 read 512 512
496 /tmp/fio/test.0.0 write 0 512
518 /tmp/fio/test.0.1 write 512 512
537 /tmp/fio/test.0.0 close
543 /tmp/fio/test.0.1 close
563 /tmp/fio/test.0.0 open
568 /tmp/fio/test.0.0 read 0 512
8507 /tmp/fio/test.0.0 close

During the workload phase the first write is to the second file (test.0.1) but during the verify phase the first read/verify is to the first file (test.0.0). The logic resets the random verify seed after the workload phase so that the verify phase will yield a sequence of expected seeds that match the data written. However, the verify assumes all data in the workload was written, so it starts with the first file, expecting its random seed to match the first random seed generated during the workload, but the first file written during the workload was the second file. The size bounds is 4K, and since this is a rw=readwrite workload that 4K includes both reads and writes, which means the full data wont be written to the each file since 1) The 4K size is split across two files and 2) the 4K size is split across reads and writes.

The fix might be to disable the random seed verification for this case, which would weaken the integrity checking of the logic. There are lots of paths which run into issues like this, which is probably why a newer verify mechanism was implemented. Checking the git history that was added all the way back in 2013. Perhaps its time to consider making it the default.

vincentkfu commented 1 year ago

I think the root cause is that the order in which the write IOs are replayed for verification differs from how they were originally committed. With this sequential workload we have IOs saved in an RB tree sorted by file and offset. So during the verify phase nodes are removed from the tree in order of file and offset which is not how they were originally written. This may be the root cause of many verification failures.

One way to resolve this is to expand the cases where we simply save the IOs to a list instead of using the RB tree. The patch below allows both the original job file and @horshack-dpreview 's simpler one to complete without issue.

diff --git a/iolog.c b/iolog.c
index cc2cbc65..ff1065a9 100644
--- a/iolog.c
+++ b/iolog.c
@@ -291,7 +291,7 @@ void log_io_piece(struct thread_data *td, struct io_u *io_u)
         * to check for duplicate blocks and drop the old one, which we rely on
         * the rb insert/lookup for handling.
         */
-       if (file_randommap(td, ipo->file)) {
+       if (file_randommap(td, ipo->file) || !td_random(td)) {
                INIT_FLIST_HEAD(&ipo->list);
                flist_add_tail(&ipo->list, &td->io_hist_list);
                ipo->flags |= IP_F_ONLIST;

However, I haven't tested the above to see if it induces any regressions.

Based on studying the code and some simple experiments I think traditional verification works for these cases:

But other cases (e.g., sequential workloads with multiple files, random workloads with no random map) may be broken as reported in this issue.

vincentkfu commented 1 year ago

I spent a little more time on this and now believe that fio is failing to handle the rw=readwrite case correctly. The job file works for rw=randwrte even though the blocks are not verified in the order they were written. Something is going wrong when we verify after completing a `rw=readwrite' workload.

A workload with rw=randwrite where verify succeeds even though verify is not carried out in the same order as the original writes:

root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --size=4k --bs=512 --rw=randwrite --verify=md5 --nrfiles=2 --write_iolog=iolog --norandommap
test: (g=0): rw=randwrite, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=psync, iodepth=1
fio-3.34-10-g2fa0
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=32718: Mon Mar 27 18:24:58 2023
  read: IOPS=2000, BW=1000KiB/s (1024kB/s)(3072B/3msec)
    clat (nsec): min=3450, max=22604, avg=9343.50, stdev=8528.33
     lat (nsec): min=4653, max=23940, avg=10793.17, stdev=8451.69
    clat percentiles (nsec):
     |  1.00th=[ 3440],  5.00th=[ 3440], 10.00th=[ 3440], 20.00th=[ 3504],
     | 30.00th=[ 3504], 40.00th=[ 4192], 50.00th=[ 4192], 60.00th=[ 4576],
     | 70.00th=[17792], 80.00th=[17792], 90.00th=[22656], 95.00th=[22656],
     | 99.00th=[22656], 99.50th=[22656], 99.90th=[22656], 99.95th=[22656],
     | 99.99th=[22656]
  write: IOPS=8000, BW=4000KiB/s (4096kB/s)(4096B/1msec); 0 zone resets
    clat (nsec): min=4540, max=47230, avg=11495.88, stdev=14835.57
     lat (nsec): min=9360, max=80670, avg=23726.75, stdev=24866.84
    clat percentiles (nsec):
     |  1.00th=[ 4512],  5.00th=[ 4512], 10.00th=[ 4512], 20.00th=[ 4704],
     | 30.00th=[ 4768], 40.00th=[ 4832], 50.00th=[ 4832], 60.00th=[ 5216],
     | 70.00th=[ 5984], 80.00th=[14656], 90.00th=[47360], 95.00th=[47360],
     | 99.00th=[47360], 99.50th=[47360], 99.90th=[47360], 99.95th=[47360],
     | 99.99th=[47360]
  lat (usec)   : 4=14.29%, 10=57.14%, 20=14.29%, 50=14.29%
  cpu          : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=6,8,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1000KiB/s (1024kB/s), 1000KiB/s-1000KiB/s (1024kB/s-1024kB/s), io=3072B (3072B), run=3-3msec
  WRITE: bw=4000KiB/s (4096kB/s), 4000KiB/s-4000KiB/s (4096kB/s-4096kB/s), io=4096B (4096B), run=1-1msec

Disk stats (read/write):
  sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
root@localhost:~/fio-dev/fio-canonical# cat iolog
fio version 3 iolog
41 test.0.0 add
44 test.0.1 add
341 test.0.0 open
378 test.0.0 write 0 512
457 test.0.1 open
479 test.0.1 write 1024 512
508 test.0.0 write 1536 512
532 test.0.1 write 512 512
544 test.0.0 write 512 512
554 test.0.1 write 1536 512
565 test.0.0 write 512 512
576 test.0.1 write 512 512
591 test.0.0 close
600 test.0.1 close
1359 test.0.0 open
1365 test.0.0 read 0 512
1410 test.0.0 read 512 512
1420 test.0.0 read 1536 512
2840 test.0.1 open
2845 test.0.1 read 512 512
2872 test.0.1 read 1024 512
2881 test.0.1 read 1536 512
15841 test.0.0 close
15847 test.0.1 close

The same options with rw=readwrite produces a failure.