axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.89k stars 407 forks source link

Operation with `IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS` does not complete on ring with `IORING_SETUP_DEFER_TASKRUN` #1235

Closed Cydox closed 2 months ago

Cydox commented 2 months ago

If I do two send operations on a socket followed by a close on the file descriptor linked by IOSQE_IO_LINK the close operation never happens. All SQEs have IOSQE_CQE_SKIP_SUCCESS set. The ring is created with IORING_SETUP_DEFER_TASKRUN.

Already did a git bisect on the kernel and this is introduced in commit 846072f16eed3b3fb4e59b677f3ed8afb8509b89 https://github.com/torvalds/linux/commit/846072f16eed3b3fb4e59b677f3ed8afb8509b89

Is it better to report this on the kernel mailing list, as it's not really an issue with liburing? I'm fine with using either the mailing list or github.

Interesting observations:

  1. If I only do one send followed by the close, it works.
  2. If I remove either IOSQE_IO_LINK or IOSQE_CQE_SKIP_SUCCESS on the close, it works.
  3. If I use IORING_SETUP_COOP_TASKRUN instead of defer, it works.
  4. If I add a third send operation that message get's sent (close still doesn't work).

Reproducer written in Zig (let me know if you want a reproducer in C):

const std = @import("std");

const linux = std.os.linux;
const posix = std.posix;
const sqe_t = std.os.linux.io_uring_sqe;
const cqe_t = std.os.linux.io_uring_cqe;

const event_type = enum {
    ACCEPT,
    READ,
    WRITE,
    CLOSE,
};

pub fn main() !void {
    const stdout = std.io.getStdOut();
    _ = stdout;
    std.debug.print("hello from normal syscall\n", .{});

    const peer = try std.net.Address.parseIp4("0.0.0.0", 9999);
    const sock = try std.posix.socket(posix.AF.INET, posix.SOCK.STREAM, 0);
    try std.posix.setsockopt(sock, posix.SOL.SOCKET, posix.SO.REUSEADDR, &std.mem.toBytes(@as(c_int, 1)));
    try std.posix.setsockopt(sock, 6, posix.TCP.NODELAY, &std.mem.toBytes(@as(c_int, 1)));

    try std.posix.bind(sock, &(peer.any), peer.getOsSockLen());
    defer std.posix.close(sock);
    try std.posix.listen(sock, 1);

    const ring_buf_size = @as(u64, 1) <<| 12;

    var io_uring = try std.os.linux.IoUring.init(
        @intCast(ring_buf_size),
        0 |
            linux.IORING_SETUP_DEFER_TASKRUN |
            linux.IORING_SETUP_SINGLE_ISSUER,
    );
    defer io_uring.deinit();

    _ = try io_uring.accept_multishot(@intFromEnum(event_type.ACCEPT), sock, null, null, 0);

    while (true) {
        _ = try io_uring.submit_and_wait(1);

        var cqes: [ring_buf_size]cqe_t = undefined;
        const n = try io_uring.copy_cqes(&cqes, 0);
        //std.debug.print("n: {}\n", .{n});

        for (cqes[0..n]) |c| {
            switch (@as(event_type, @enumFromInt(c.user_data))) {
                event_type.ACCEPT => {
                    var sqe = try io_uring.send(@intFromEnum(event_type.WRITE), c.res, "hello from server 1\n", 0);
                    sqe.flags |= linux.IOSQE_CQE_SKIP_SUCCESS;

                    sqe = try io_uring.send(@intFromEnum(event_type.WRITE), c.res, "hello from server 2\n", 0);
                    sqe.flags |= linux.IOSQE_IO_LINK;
                    sqe.flags |= linux.IOSQE_CQE_SKIP_SUCCESS;

                    sqe = try io_uring.close(@intFromEnum(event_type.CLOSE), c.res);
                    sqe.flags |= linux.IOSQE_IO_LINK;
                    sqe.flags |= linux.IOSQE_CQE_SKIP_SUCCESS;
                },
                event_type.READ => {},
                event_type.WRITE => {},
                event_type.CLOSE => {},
            }
        }
    }
}

Can be built using zig 0.13:

zig build-exe main.zig

To test simply connect using telnet:

$ telnet 127.0.0.1 9999

Expected behavior is to get two messages from the server and then be disconnected. Observed behavior is getting the two messages and the connection not being closed.

axboe commented 2 months ago

Funky - that last link isn't valid to set (as there's nothing after it, hence it can't link to anything), but it should just be ignored. Not sure why it isn't, will take a look.

Cydox commented 2 months ago

Ohh yes, I'm using IOSQE_IO_LINK the wrong way around.

I'm mostly poking around the API trying to understand more about task_work and IORING_SETUP_DEFER_TASKRUN / IORING_SETUP_COOP_TASKRUN. Are there any more resources on this besides the man pages or looking at the kernel source?

axboe commented 2 months ago

Check the doc in the wiki, that has more details on DEFER_TASKRUN:

https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023

axboe commented 2 months ago

I suspect this should fix it:

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f3570e81ecb4..75f0087183e5 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2579,9 +2579,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
         * If we got woken because of task_work being processed, run it
         * now rather than let the caller do another wait loop.
         */
-       io_run_task_work();
        if (!llist_empty(&ctx->work_llist))
            io_run_local_work(ctx, nr_wait);
+       io_run_task_work();

        /*
         * Non-local task_work will be run on exit to userspace, but

as we'll want to run task_work after having run our local task_work, in case the local task_work generates system wide task_work. Like the final put of a file, for example...

Cydox commented 2 months ago

It does indeed fix it.

axboe commented 2 months ago

Excellent, just committed it and sent it to the list. If you want a Reported-by: tag added to the commit, please reply in here with identity/name + email so I can add it. I already added the link to this bug report.

axboe commented 2 months ago

I also write a C reproducer which is standalone to test it. It's quite the odd construct to get there, I'm impressed ;-)

Cydox commented 2 months ago

Nice, really fast turn-around!

for the tag: Jan Hendrik Farr kernel@jfarr.cc

axboe commented 2 months ago

Tag added:

https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/io_uring&id=23c021817fecc0759a2702a1f942463c15eda5fe

and the test case pushed out as well.

axboe commented 2 months ago

Nice, really fast turn-around!

We take bug reports seriously :-)

axboe commented 2 months ago

I'll close this one up. Fix will go into the 6.12-rc1 release, but also marked for stable backport. It should find its way into 6.6/6.10/6.11 stable as well, other kernels either not affected or are end-of-life already. Thanks for the report with a reproducer!