Fio verify failling when running with verify_backlog

Vladyyy commented 5 years ago

Hi,

When running the below FIO config with the config below on a raw block device, fio verification fails.

[global]
direct=1
ioengine=libaio
verify=crc32c
verify_fatal=1
verify_dump=1
rw=randrw
iodepth=16
time_based=1
verify_backlog=10
bs=256k
runtime=${RUNTIME}

[/dev/datalog]

When looking at the debug log it seems there there is a race between the write and the read that verifies the data. The read is submitted before the write completes.

io 12112 fill: io_u 0x953f00: off=0x30c0000,len=0x40000,ddir=1,file=/dev/xvdbd
io 12112 prep: io_u 0x953f00: off=0x30c0000,len=0x40000,ddir=1,file=/dev/xvdbd
io 12112 queue: io_u 0x953f00: off=0x30c0000,len=0x40000,ddir=1,file=/dev/xvdbd
….
io 12112 prep: io_u 0x954240: off=0x30c0000,len=0x40000,ddir=0,file=/dev/xvdbd
io 12112 queue: io_u 0x954240: off=0x30c0000,len=0x40000,ddir=0,file=/dev/xvdbd
….
io 12112 complete: io_u 0x954240: off=0x30c0000,len=0x40000,ddir=0,file=/dev/xvdbd

crc32c: verify failed at file /dev/xvdbd offset 51118080, length 262144 (requested block: offset=51118080, length=262144)
/dev/xvdbd: (groupid=0, jobs=1): err=84 (file:io_u.c:2068, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=12112: Fri Jun 14 11:27:37 2019
?io 12112 complete: io_u 0x951ec0: off=0x2b80000,len=0x40000,ddir=0,file=/dev/xvdbd

I'm wondering if the above behavior is expected. Running with serialize_overlap solves the issue, however my understanding is that verify_backlog should work without specifying serialize_overlap.

sitsofe commented 5 years ago

@Vladyyy this comes down to what happens when you do rw=(rw|randrw) job.

verify=crc32c
[...]
rw=randrw
[...]
time_based=1
[...]
runtime=${RUNTIME}

The "problem" arises when you create a verify job that does reads (rather than only writes). Doing so actually tells fio to try and verify the data that is already on the disk. This is somewhat reasonable when it's a pure read job but becomes complicated when it's a read and write job (leading to the questions like "what should get verified and when?", "What happens when looping occurs?" etc.). For now, my recommendation is: don't create rw=rw or rw=randrw jobs that also verifying. Instead, create a rw=write / rw=randwrite with verify_backlog=[num] and do_verify=0 and if you must have some extra random reads (beyond what the verifying does) in the mix do so via another job that isn't doing verification.

This is likely a duplicate of issue #746 .

sitsofe commented 5 years ago

@Vladyyy now I'm wondering whether my previous analysis is correct now I've re-read what you wrote:

When looking at the debug log it seems there there is a race between the write and the read that verifies the data. The read is submitted before the write completes.

It would be interesting to know what submitted the read - was it a "regular" read but of data fio knew it had written previously (due rw=randrw verify=crc32c etc) or was it a verify_backlog initiated read? I'm not sure this is easy to find out though.

I'm wondering if the above behavior is expected.

Expected? Maybe. Desirable? Unlikely! My guess is that this is happening at the point that fio is forced to loop (random) write I/O. Say you've got and iodepth of 10 and there are 5 reads in flight from the old pass, if you're doing random I/O there's a non-zero chance that one of your new writes could be to a region still being read. I suppose if your I/O is sequential then it impossible for this to happen unless the region you're doing writes over is small...

I suppose you could drain all old pass I/Os BEFORE allowing the new pass to start when doing verification that could loop... I don't know how happy people will be about doing it by default (I suppose you could put in a --no-drain option for those who prefer the old way) but it would be less costly that checking overlaps on all I/O all the time. We would also have to be aware that this only gets us out of trouble in specific cases (verified looping I/O with uniformly distributed blocks) because if a user sets norandomap then all bets are off and you pretty much have to check for overlaps on all in-flight I/O all the time.

@axboe any thoughts on this one?

Vladyyy commented 5 years ago

@sitsofe

It would be interesting to know what submitted the read - was it a "regular" read but of data fio knew it had written previously (due rw=randrw verify=crc32c etc) or was it a verify_backlog initiated read? I'm not sure this is easy to find out though.`

I'm able to reproduce this with a randomwrite workload so it's the verify_backlog paramater.

[global]
direct=1
ioengine=libaio
verify=crc32c
verify_fatal=1
verify_dump=1
rw=randwrite
iodepth=32
bs=256k
time_based=1
verify_backlog=10
runtime=96000

[/dev/datalog]

My guess is that this is happening at the point that fio is forced to loop (random) write I/O

I think you intuition is correct. If I set the size parameter to small value (to force the loops to occur more frequent) it happens more often.

Forgot to mention, I'm using fio version 3.14

sitsofe commented 5 years ago

@Vladyyy OK thanks for the update. For now (as you stated in your first comment) people will have to serialize_overlap to avoid triggering this...

bardavid commented 4 years ago

there are some issues between get_next_verify and log_io_piece, unless there is a deeper i/o synchronization i'm missing:

when fio uses randommap then it will append each io_piece to a linked list, not caring for overwrites of unverified ios. get_next_verify could get an obsolete io_piece (offset already overwritten).
when fio uses norandommap then overwrites take care to remove old ios using a rbtree, however they are not synchronizing with in-flight verification running on it.
i think overall locking when adding/removing elements from either structures is missing. given that write and verify flows may run on different threads this should be fixed

i'm not sure why fio optimizes the randommap case where it uses a linked list instead of having a unified data structure to index io_pieces, it seems to me that the overhead of the rbtree should be rather small since it contains only outstanding ios

i propose unifying the list and rbtree structures to a single rbtree and adding the proper synchronization of overwrites vs. verifications..

if you're okay with it i can start coding it..

@sitsofe what do you think ?

sitsofe commented 4 years ago

@bardavid sorry for the slow reply.

i'm not sure why fio optimizes the randommap case where it uses a linked list instead of having a unified data structure to index io_pieces, it seems to me that the overhead of the rbtree should be rather small since it contains only outstanding ios

The only clue I found seems to be 8de8f047bd025f12d23cfc3fc1793434c6d8ff94 which suggests it was done for speed reasons. However I wonder if a big motivator disappeared when f31feaa21642929b6d9d5396b73669372fda9a0a dropped verifysort.

The only other thing to be aware of is experimental_verify, which depends on I/Os being done and completed in the exact order they were generated.

@axboe I can't see a reason not to just always use the rbtree for verification - is maintaining it such a big overhead?

i propose unifying the list and rbtree structures to a single rbtree and adding the proper synchronization of overwrites vs. verifications..

This sounds like the right thing to do me. Verification will become a touch slower but at least it will be correct. I supposed you only have to have to do the locking in cases where you know verifcation could race with writes (randrw verification, verify_async with a write job).

bardavid commented 4 years ago

@sitsofe thanks for your response . do you think i should wait for Jens' response before i start coding the proposal ?

sitsofe commented 4 years ago

@bardavid :-) I vote for you to start - you've got a good track record of working on necessary things...

sitsofe commented 4 years ago

@bardavid arguably would 2. be taken care of by serialize_overlap=1?

bardavid commented 4 years ago

@bardavid arguably would 2. be taken care of by serialize_overlap=1?

i think yes.

since serialize_overlap pushes back io submissions originating both from verify and write if they overlap with something in flight, it should make the overwrite and verify mutually exclusive.

overall, i think we should eliminate serialize_overlap logic from the io_u_submit() since it's o(n) on the in-flight ios and we're already aiming to do the same thing using the rbtree, which is log(n), effectively achieving overlap serialization without the user requesting to do so. moreover, we miss some verification chances when we overwrite since rbtree overlap logic removes old entry and it happens before the io_u_submit pushback... it is better, imo, to serialize the writes and read-verify both ios

on another matter,

i see that if serialize_overlap is used in conjunction with io_submit_mode=offload then serialization happens across different jobs. reading my comment again on missing locking on the rbtree/linked list, and looking at the code again, it seems that these lists are being accessed by a single thread controlling the job (whether submitting writes or verify-reads, either structures access seems to be single threaded). so i think we are already covered there.

@sitsofe do you think the rbtree should serialize writes originating from all jobs as well (currently there's an rbtree per job, so unless i globalize the structure, it won't) ?

axboe / fio

Fio verify failling when running with verify_backlog #792