axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

Question: will io_uring block in this scenario in the same way as libaio? #1184

Open travisdowns opened 1 month ago

travisdowns commented 1 month ago

Currently libaio-based IO may block in io_submit(2) even if RWF_NOWAIT is passed if we hit the nr_requests limit of requests enqueued in the block layer. This is a death sentence for async thread-per-core architectures where if the only thread blocks all work stops.

Will io_uring block in a similar way in io_uring_enter? I.e., does it try to do the same type of IO setup that io_submit does "sync" with the system call, relying on NOWAIT to ensure it doesn't block? Or does it always punt to some kernel thread to make those calls?

Ref: https://lore.kernel.org/linux-block/20190724022309.GZ7777@dread.disaster.area/

axboe commented 1 month ago

It will not block, io_uring is actually async, aio is just not.

travisdowns commented 1 month ago

It will not block, io_uring is actually async, aio is just not.

Much thanks for the speedy reply! What are the mechanics of that? Doesn't io_uring rely on the same NOWAIT-type infrastructure to ask the file system not to wait? Or is that already happening in an async context rather than inline with io_uring_enter?

Basically it looks to me that on the io_uring_enter path we try to submit the sqe "inline" (by inline I mean in the context of the system call, so it blocks the user application will see the syscall block), and then in io_write (to take one example of a submission function) we will do an aio-style IOCB_NOWAIT write call (i.e., sharing now the same code path as if you were using aio and io_submit(2) instead), which is known to block on XFS (for example) as XFS does not pass IOCB_NOWAIT on to the block layer.

travisdowns commented 1 month ago

I decided to try it, here's an example of the blocking I'm thinking about:

fio 3642736 [007] 359323.757309:       sched:sched_switch: prev_comm=fio prev_pid=3642736 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
        ffffffff8193f5d4 __traceiter_sched_switch+0x44 ([kernel.kallsyms])
        ffffffff8193f5d4 __traceiter_sched_switch+0x44 ([kernel.kallsyms])
        ffffffff82938d23 __schedule+0x363 ([kernel.kallsyms])
        ffffffff82939183 schedule+0x63 ([kernel.kallsyms])
        ffffffff829392c6 io_schedule+0x46 ([kernel.kallsyms])
        ffffffff81f8fd27 blk_mq_get_tag+0x117 ([kernel.kallsyms])
        ffffffff81f89bc0 __blk_mq_alloc_requests+0x200 ([kernel.kallsyms])
        ffffffff81f8c4fd blk_mq_submit_bio+0x1bd ([kernel.kallsyms])
        ffffffff81f797e3 __submit_bio+0xb3 ([kernel.kallsyms])
        ffffffff81f79f0c submit_bio_noacct_nocheck+0x13c ([kernel.kallsyms])
        ffffffff81f7a14c submit_bio_noacct+0x17c ([kernel.kallsyms])
        ffffffff81f7a66c submit_bio+0x6c ([kernel.kallsyms])
        ffffffff81d5e834 iomap_dio_submit_bio+0x84 ([kernel.kallsyms])
        ffffffff81d5eeb4 iomap_dio_bio_iter+0x2d4 ([kernel.kallsyms])
        ffffffff81d5f448 __iomap_dio_rw+0x398 ([kernel.kallsyms])
        ffffffff81d5fa21 iomap_dio_rw+0x11 ([kernel.kallsyms])
        ffffffffc0e3cf8f xfs_file_dio_write_aligned+0x9f ([kernel.kallsyms])
        ffffffffc0e3db43 xfs_file_write_iter+0x113 ([kernel.kallsyms])
        ffffffff81fed83e io_write+0x12e ([kernel.kallsyms])
        ffffffff81fd9ce5 io_issue_sqe+0x65 ([kernel.kallsyms])
        ffffffff81fda618 io_submit_sqes+0x128 ([kernel.kallsyms])
        ffffffff81fdabfb __do_sys_io_uring_enter+0x2fb ([kernel.kallsyms])
        ffffffff81fdadd2 __x64_sys_io_uring_enter+0x22 ([kernel.kallsyms])
        ffffffff81805673 x64_sys_call+0x1963 ([kernel.kallsyms])
        ffffffff82921ff5 do_syscall_64+0x55 ([kernel.kallsyms])
        ffffffff82a000eb entry_SYSCALL_64_after_hwframe+0x73 ([kernel.kallsyms])

This is on a fast SSD so I set nr_requests down to a low value to trigger it and use the following fio workload:

[file1]
name=fio-seq-write
time_based
rw=randwrite
bs=4K
direct=1
numjobs=1
runtime=1s
size=1GB
ioengine=io_uring
iodepth=500

io_uring_enter tries to do the "inline issue" (not sure of the correct term) of the sqe, but XFS does not pass through NOWAIT to the block layer which promptly blocks while trying to allocate the bio in the blk-mq layer, and the application thread blocks.

axboe commented 1 month ago

RQF_NOWAIT is the mechanism. If something is blocking on other IO off the submission path, then that piece is buggy. In this case that looks like XFS is. I can take a look at XFS - what kernel version are you using?

axboe commented 1 month ago

XFS needs something like the below. Not really tested, may be daemons lurking... But it's certainly losing the NOWAIT flag, which is the bug here.

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..2bf24509be13 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -258,10 +258,13 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
        const struct iomap *iomap, bool use_fua)
 {
-   blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
+   blk_opf_t opflags = 0;
+
+   if (dio->iocb->ki_flags & IOCB_NOWAIT)
+       opflags |= REQ_NOWAIT;

    if (!(dio->flags & IOMAP_DIO_WRITE))
-       return REQ_OP_READ;
+       return REQ_OP_READ | opflags;

    opflags |= REQ_OP_WRITE;
    if (use_fua)
travisdowns commented 1 month ago

Thanks Jens, your response is much appreciated.

RQF_NOWAIT is the mechanism. If something is blocking on other IO off the submission path, then that piece is buggy. In this case that looks like XFS is. I can take a look at XFS - what kernel version are you using?

This is 6.5, but I believe the behavior is the same all the back to 5.x and forward to the tip.

XFS needs something like the below. Not really tested, may be daemons lurking... But it's certainly losing the NOWAIT flag, which is the bug here.

Agreed, though fs/iomap/direct-io.c is generic helper code shared by various FS, right? So anyone using that would have the same bug.

This did come up before in this thread where the XFS maintainer mentioned that this requests-exhausted blocking case isn't interesting for Seastar users (Seastar was the example given there), but as a Seastar user I can say that we are definitely interested in this case and run into it in practice. Dave is correct that avoiding locking etc in the XFS layer is also very important but wrong in that it's the primary/only thing. Any blocking is problematic as userspace application has no way to avoid it if NOWAIT doens't work.

axboe commented 1 month ago

This is 6.5, but I believe the behavior is the same all the back to 5.x and forward to the tip.

Could very well be.

Agreed, though fs/iomap/direct-io.c is generic helper code shared by various FS, right? So anyone using that would have the same bug.

Not that many for DIO writes. ext4 adopted it for dio recently, btrfs still has its own iirc. I may be wrong, didn't double check.

Dave isn't completely wrong, though xfs/iomap should do the right thing here. In practice it tends not to be a huge issue as nvme devices generally have very deep per-cpu queues, so risk of blocking is small. But it certainly can happen if the depth is more limited, or the device is driven too hard, and we should certainly get this blocking condition fixed. Your test case is 4k writes, which are easy enough. I'd be more nervous about longer looping where we've already done a bunch of writes and now we have one that blocks and returns -EAGAIN. Hence the devil is in the details here.

Just back from 2 weeks away so don't have a lot of bandwidth to test this, but remind me end next week if I don't get back to you before then.