axboe / fio

Flexible I/O Tester
GNU General Public License v2.0
5.26k stars 1.26k forks source link

fio crashes with "failed to unlock overlap check mutex, err: 0:success" #1807

Open NickiXsight opened 2 months ago

NickiXsight commented 2 months ago

Please acknowledge the following before creating a ticket

Description of the bug: When running with huge FIO that involves multiple jobs with verify we run with serialize_overlap=1 then FIO aborts at some point with the error in the title.

After looking into the code I see 4 problems:

  1. all mutex operation errors report errno instead of pthread_mutex_X return code, which is wrong -- pthread_mutex doesn't set errno, at least in debian xs86_64
  2. the same error is reported from several functions and it makes it hard to identify which one really fired
  3. after adding some MACRO wrapper with func FILE LINE I figured out that ioengines.c : td_io_queue fires the messages -- and indeed, it unlocks the lock BEFORE it finally enqueues the request, so it can actually return FIO_Q_BUSY and then rate-submit.c:io_workqueue_fn will submit it again. But the lock is already released
  4. The issue is actually not only the lock -- when FIO_Q_BUSY was returned the io_u is cleared from its' td, so another worker can allocate that LBA, so prior to calling the td_io_queue again check_overlap should be executed again

Environment: debian x86_64

fio version: fio-3.37-86-g7bc1

Reproduction steps [write-and-verify] rw=randwrite bs=4k direct=1 ioengine=libaio iodepth=128 verify=crc32c verify_backlog=100000 verify_dump=1 verify_fatal=1 verify_async=4 serialize_overlap=1 io_submit_mode=offload blocksize_range=4k-8k runtime=6000 size=512m numjobs=10 filename=/dev/nvme0n8:/dev/nvme0n7:/dev/nvme0n6:/dev/nvme0n5:/dev/nvme0n4:/dev/nvme0n3:/dev/nvme0n2:/dev/nvme0n1

axboe commented 2 months ago

Since you already did the full analysis, care to send a fix for this?

NickiXsight commented 2 months ago

The 1 and 2 are really simple, I can create a PR tomorrow. But 3 and 4 require more attention -- just fixing the lock-unlock scheme is not enough, I have to create a good test that proves that indeed overlap conflict may happen during the requeue, and then -- if I push the check_overlap into the requeue loop, what is the performance impact of all this? I'll do PR soon and we'll discuss it.