Open Vladyyy opened 5 years ago
@Vladyyy this comes down to what happens when you do rw=(rw|randrw) job.
verify=crc32c
[...]
rw=randrw
[...]
time_based=1
[...]
runtime=${RUNTIME}
The "problem" arises when you create a verify job that does reads (rather than only writes). Doing so actually tells fio to try and verify the data that is already on the disk. This is somewhat reasonable when it's a pure read job but becomes complicated when it's a read and write job (leading to the questions like "what should get verified and when?", "What happens when looping occurs?" etc.). For now, my recommendation is: don't create rw=rw or rw=randrw jobs that also verifying. Instead, create a rw=write / rw=randwrite with verify_backlog=[num] and do_verify=0 and if you must have some extra random reads (beyond what the verifying does) in the mix do so via another job that isn't doing verification.
This is likely a duplicate of issue #746 .
@Vladyyy now I'm wondering whether my previous analysis is correct now I've re-read what you wrote:
When looking at the debug log it seems there there is a race between the write and the read that verifies the data. The read is submitted before the write completes.
It would be interesting to know what submitted the read - was it a "regular" read but of data fio knew it had written previously (due rw=randrw
verify=crc32c
etc) or was it a verify_backlog
initiated read? I'm not sure this is easy to find out though.
I'm wondering if the above behavior is expected.
Expected? Maybe. Desirable? Unlikely! My guess is that this is happening at the point that fio is forced to loop (random) write I/O. Say you've got and iodepth of 10 and there are 5 reads in flight from the old pass, if you're doing random I/O there's a non-zero chance that one of your new writes could be to a region still being read. I suppose if your I/O is sequential then it impossible for this to happen unless the region you're doing writes over is small...
I suppose you could drain all old pass I/Os BEFORE allowing the new pass to start when doing verification that could loop... I don't know how happy people will be about doing it by default (I suppose you could put in a --no-drain
option for those who prefer the old way) but it would be less costly that checking overlaps on all I/O all the time. We would also have to be aware that this only gets us out of trouble in specific cases (verified looping I/O with uniformly distributed blocks) because if a user sets norandomap
then all bets are off and you pretty much have to check for overlaps on all in-flight I/O all the time.
@axboe any thoughts on this one?
@sitsofe
It would be interesting to know what submitted the read - was it a "regular" read but of data fio knew it had written previously (due rw=randrw verify=crc32c etc) or was it a verify_backlog initiated read? I'm not sure this is easy to find out though.`
I'm able to reproduce this with a randomwrite workload so it's the verify_backlog paramater.
[global]
direct=1
ioengine=libaio
verify=crc32c
verify_fatal=1
verify_dump=1
rw=randwrite
iodepth=32
bs=256k
time_based=1
verify_backlog=10
runtime=96000
[/dev/datalog]
My guess is that this is happening at the point that fio is forced to loop (random) write I/O
I think you intuition is correct. If I set the size parameter to small value (to force the loops to occur more frequent) it happens more often.
Forgot to mention, I'm using fio version 3.14
@Vladyyy OK thanks for the update. For now (as you stated in your first comment) people will have to serialize_overlap
to avoid triggering this...
there are some issues between get_next_verify
and log_io_piece
, unless there is a deeper i/o synchronization i'm missing:
get_next_verify
could get an obsolete io_piece (offset already overwritten). i'm not sure why fio optimizes the randommap case where it uses a linked list instead of having a unified data structure to index io_pieces, it seems to me that the overhead of the rbtree should be rather small since it contains only outstanding ios
i propose unifying the list and rbtree structures to a single rbtree and adding the proper synchronization of overwrites vs. verifications..
if you're okay with it i can start coding it..
@sitsofe what do you think ?
@bardavid sorry for the slow reply.
i'm not sure why fio optimizes the randommap case where it uses a linked list instead of having a unified data structure to index io_pieces, it seems to me that the overhead of the rbtree should be rather small since it contains only outstanding ios
The only clue I found seems to be 8de8f047bd025f12d23cfc3fc1793434c6d8ff94 which suggests it was done for speed reasons. However I wonder if a big motivator disappeared when f31feaa21642929b6d9d5396b73669372fda9a0a dropped verifysort
.
The only other thing to be aware of is experimental_verify
, which depends on I/Os being done and completed in the exact order they were generated.
@axboe I can't see a reason not to just always use the rbtree for verification - is maintaining it such a big overhead?
i propose unifying the list and rbtree structures to a single rbtree and adding the proper synchronization of overwrites vs. verifications..
This sounds like the right thing to do me. Verification will become a touch slower but at least it will be correct. I supposed you only have to have to do the locking in cases where you know verifcation could race with writes (randrw
verification, verify_async
with a write job).
@sitsofe thanks for your response . do you think i should wait for Jens' response before i start coding the proposal ?
@bardavid :-) I vote for you to start - you've got a good track record of working on necessary things...
@bardavid arguably would 2. be taken care of by serialize_overlap=1
?
@bardavid arguably would 2. be taken care of by
serialize_overlap=1
?
i think yes.
since serialize_overlap
pushes back io submissions originating both from verify and write if they overlap with something in flight, it should make the overwrite and verify mutually exclusive.
overall, i think we should eliminate serialize_overlap
logic from the io_u_submit()
since it's o(n) on the in-flight ios and we're already aiming to do the same thing using the rbtree, which is log(n), effectively achieving overlap serialization without the user requesting to do so. moreover, we miss some verification chances when we overwrite since rbtree overlap logic removes old entry and it happens before the io_u_submit pushback... it is better, imo, to serialize the writes and read-verify both ios
on another matter,
i see that if serialize_overlap
is used in conjunction with io_submit_mode=offload
then serialization happens across different jobs.
reading my comment again on missing locking on the rbtree/linked list, and looking at the code again, it seems that these lists are being accessed by a single thread controlling the job (whether submitting writes or verify-reads, either structures access seems to be single threaded). so i think we are already covered there.
@sitsofe do you think the rbtree should serialize writes originating from all jobs as well (currently there's an rbtree per job, so unless i globalize the structure, it won't) ?
Hi,
When running the below FIO config with the config below on a raw block device, fio verification fails.
When looking at the debug log it seems there there is a race between the write and the read that verifies the data. The read is submitted before the write completes.
I'm wondering if the above behavior is expected. Running with serialize_overlap solves the issue, however my understanding is that verify_backlog should work without specifying serialize_overlap.