axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

Unexpected EFAULT when doing sendmsg + cmsg and IOSQE_ASYNC #880

Closed majek closed 3 weeks ago

majek commented 1 year ago

Hi!

I'm playing with doing UDP networking and IOSQE_ASYNC. Here's a repro:

#include <errno.h>
#include <error.h>
#include <liburing.h>
#include <netinet/in.h>
#include <netinet/udp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>

int fill_cmsg(struct msghdr *msgh, int gro_sz)
{
    struct cmsghdr *cmsg;

    int space = 0;
    /* To ensure CMSG_NXTHDR contorllen must be
     * large and the buffer must be zeroed. */
    memset(msgh->msg_control, 0, msgh->msg_controllen);

    cmsg = CMSG_FIRSTHDR(msgh);

    {
        uint16_t gro_sz_short = gro_sz;
        cmsg->cmsg_level = SOL_UDP;
        cmsg->cmsg_type = UDP_SEGMENT;
        cmsg->cmsg_len = CMSG_LEN(sizeof(gro_sz_short));
        memcpy(CMSG_DATA(cmsg), &gro_sz_short, sizeof(gro_sz_short));
        cmsg = CMSG_NXTHDR(msgh, cmsg);
        space += CMSG_SPACE(sizeof(gro_sz_short));
    }
    msgh->msg_controllen = space;
    return 1;
}

int main(int argc, char **argv)
{
    int inflight = 32;
    int sq_async = 7;
    if (argc > 1) {
        inflight = atoi(argv[1]);
    }
    if (argc > 2) {
        sq_async = atoi(argv[2]);
    }
    printf("inflight=%d sq_async=%d\n", inflight, sq_async);

    int sd = socket(AF_INET, SOCK_DGRAM, 0);
    if (sd < 0) {
        error(-1, errno, "socket(SOCK_DGRAM)");
    }

    struct sockaddr_in listen = {.sin_family = AF_INET};

    int r = bind(sd, (struct sockaddr *)&listen, sizeof(listen));
    if (r != 0) {
        error(-1, errno, "bind()");
    }

    socklen_t sz = sizeof(listen);
    r = getsockname(sd, (struct sockaddr *)&listen, &sz);
    if (r != 0) {
        error(-1, errno, "getsockname()");
    }

    int sd2 = socket(AF_INET, SOCK_DGRAM, 0);
    if (sd2 < 0) {
        error(-1, errno, "socket(SOCK_DGRAM)");
    }

    r = connect(sd2, (struct sockaddr *)&listen, sizeof(listen));
    if (r != 0) {
        error(-1, errno, "connect()");
    }

    struct io_uring ring;
    r = io_uring_queue_init(inflight * 2, &ring, 0);
    if (r != 0) {
        error(-1, errno, "io_uring_queue_init");
    }

    unsigned values[2] = {0, sq_async};
    io_uring_register_iowq_max_workers(&ring, values);

    char ctrl[256];
    char data[65500] = "hello world\n";
    struct iovec iovec = {
        .iov_base = data,
        .iov_len = sizeof(data),
    };
    struct msghdr msghdr = {
        .msg_name = &listen,
        .msg_namelen = sizeof(listen),
        .msg_iov = &iovec,
        .msg_iovlen = 1,
        .msg_control = ctrl,
        .msg_controllen = sizeof(ctrl),
    };
    fill_cmsg(&msghdr, 1400);

    int i;
    for (i = 0; i < inflight; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
        if (sqe == NULL) {
            error(-1, errno, "io_uring_get_sqe");
        }
        io_uring_prep_sendmsg(sqe, sd2, &msghdr, 0);
        sqe->flags |= IOSQE_ASYNC;
        io_uring_sqe_set_data64(sqe, i);
    }

    while (1) {
        struct io_uring_cqe *cqe = NULL;
        int r = io_uring_submit_and_wait_timeout(&ring, &cqe, 1, NULL, NULL);
        if (r < 0) {
            error(-1, errno, "io_uring_wait_cqe");
        }

        if (cqe->res <= 0) {
            error(-1, -cqe->res, "io_uring_sendmsg");
        }
        int i = (uint64_t)cqe->user_data;
        io_uring_cqe_seen(&ring, cqe);

        struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
        if (sqe == NULL) {
            error(-1, errno, "io_uring_get_sqe");
        }
        io_uring_prep_sendmsg(sqe, sd2, &msghdr, 0);
        sqe->flags |= IOSQE_ASYNC;
        io_uring_sqe_set_data64(sqe, i);
    }
}

Run:

$ gcc -Wall -Wextra efault.c -I ./liburing/src/include/ ./liburing/src/liburing.a && ./a.out 32 7
inflight=32 sq_async=7
./a.out: io_uring_sendmsg: Bad address

What we do is set 32 sendmsg SQE's that fly towards a connected socket all the time - the socket is overfilled and packets dropped, and cap the iowq_max_workers at 7.

I think we should hear EAGAIN. This seem to be confirmed with

$ sudo ~/bin/bpftrace -e 'kretfunc:udp_sendmsg / (int32)retval < 0/ { printf("%p %d %d\n", args->sk, args->len, retval); }' 
Attaching 1 probe...
0xffff8c970a06ec00 65500 -11

-11 = -EAGAIN

However, the cqe->res contains -14 which is EFAULT.

I don't expect EFAULT here. AFAICT the sendmsg arguments are valid.

The problem only occurs when:

You might need to massage the two parameters of the test program (inflight and iowq settings), depending on your machine. I suspect that the EAGAIN from udp_sendmsg happens on full send buffer, so we need to fill the buffer faster than it's drained/dropped by network subsystem.

Kernel 6.4-rc5 liburing master branch commit d39530a3f4b52ff2fab653eb3b2314e0aedc2d92

axboe commented 1 year ago

That is definitely a bug. I took a look at it, and came up with:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-6.4&id=f7e2fa2b65127e162f6509bb73debf44e6ccc10b

Can you try and pull that patch (or branch) into your 6.4-rc kernel?

Also, in terms of attribution, I like to add a Reported-by tag to the commit so it gives credit to the person that found it. For that, I'd need your real name and email. Let me know if you want that added and provide those details. Totally up to you, I've linked this issue as well.

majek commented 1 year ago
Reported-by: Marek Majkowski <marek@cloudflare.com>
Tested-by: Marek Majkowski <marek@cloudflare.com>
majek commented 1 year ago

I've run the patch and I can see no -EFAULT anymore. I'll attempt to get a deeper understanding if it works soon. However, one thing that is immediately visible, is that in past, before the patch the ss -au showed sending socket (transmitting) with varying Snd buffer, sometimes small, sometimes large. The issue was triggered AFAIU when the snd buffer was reached (in hindsight, maybe testing TCP would be better, however, does TCP do cmsg?).

With the patch the SND buffer of the connected socket is always zero.

Just saying. I'm not fully understanding it yet.