axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

Unexpected io_uring_submit() return value after io_uring_wait_cqe_timeout()? #411

Closed EmilyPetersonM closed 3 weeks ago

EmilyPetersonM commented 3 years ago

Hello! We’re running liburing master (as of e98b974502d) on Linux 5.4.17 and are running into some odd and possibly unexpected behavior. We noticed while using io_uring_wait_cqe_timeout() that occasionally a later call to io_uring_submit() would return a count of submitted SQE’s one greater than the number we expected because we thought the return value should equal the number of SQE’s previously acquired with io_uring_get_sqe() that have not yet been submitted.

After looking deeper into the liburing code, it seems that this is happens when an expired timeout event from a timeout SQE created internally by liburing (with user_data == LIBURING_UDATA_TIMEOUT) gets handled in __io_uring_peek_cqe(), which ultimately causes the system call to io_uring_enter() to be skipped in _io_uring_get_cqe(), even when there are SQE’s that need to be submitted. This means that when calling io_uring_wait_cqe_timeout(), it is not guaranteed that the timeout SQE created by liburing will be properly submitted until the user makes a later call to io_uring_submit() when they would see the extra submission reflected in the return value.

Below is a short reproduction:

#include <stdio.h>
#include <string.h>
#include <liburing.h>

int main(void)
{
        struct io_uring ring;
        int ret;
        struct io_uring_sqe *sqe;
        struct io_uring_cqe *cqe;
        struct __kernel_timespec one_nano_timeout = {
                .tv_sec = 0L,
                .tv_nsec = 1L
        };
        struct __kernel_timespec one_second_timeout = {
                .tv_sec = 1L,
                .tv_nsec = 0L
        };
        long long int user_data = 42;

        /* Create io_uring */
        ret = io_uring_queue_init(8, &ring, 0);
        if (ret < 0) {
                fprintf(stderr, "queue_init: %s\n", strerror(-ret));
                return 1;
        }

        /* Submit no-op request that the kernel can immediately fulfill */
        sqe = io_uring_get_sqe(&ring);
        if (!sqe) {
                fprintf(stderr, "io_uring_get_sqe returned null SQE");
                return 1;
        }
        io_uring_prep_nop(sqe);
        io_uring_sqe_set_data(sqe, (void *)user_data);
        ret = io_uring_submit(&ring);
        if (ret != 1) {
                fprintf(stderr, "io_uring_submit unexpected ret: %d\n", ret);
                return 1;
        }

        /*
         * Wait with a short, one nanosecond timeout, to ensure that the
         * timeout expires prior to a second wait_cqe_timeout call below
         */
        ret = io_uring_wait_cqe_timeout(&ring, &cqe, &one_nano_timeout);
        if (ret < 0) {
                fprintf(stderr, "io_uring_wait_cqe_timeout unexpected error: "
                                "%s\n", strerror(-ret));
                return 1;
        }
        printf("Nop completion received with user_data: %lu and res: %d\n",
                        (long unsigned int)(cqe->user_data), cqe->res);
        io_uring_cqe_seen(&ring, cqe);

        /*
         * Wait with a one second timeout, which will return immediately
         * with the expiration of the previous, one nanosecond timeout
         */
        ret = io_uring_wait_cqe_timeout(&ring, &cqe, &one_second_timeout);
        if (ret < 0) {
                printf("io_uring_wait_cqe_timeout 2: %s\n", strerror(-ret));
        }

        /*
         * Call submit, expecting a return value of zero because no call to
         * io_uring_get_sqe has been made to allocate a new SQE entry since
         * the no-op request. Instead, 1 is returned, due to the presence
         * of the one second timeout SQE created by liburing.
         */
        ret = io_uring_submit(&ring);
        printf("Final submit ret: %d\n", ret);

        /* Tear down io_uring */
        io_uring_queue_exit(&ring);

        return 0;
}
banxianxie commented 3 years ago

kernel version: Linux xie-PC 5.14.0-rc4-amd64-desktop #1 SMP Wed Aug 4 14:50:57 CST 2021 x86_64 GNU/Linux

run your reproduction code: ./issue
Nop completion received with user_data: 42 and res: 0
io_uring_wait_cqe_timeout 2: Timer expired
Final submit ret: 0 (return value is zero )

EmilyPetersonM commented 3 years ago

I don't expect this scenario to arise for Linux versions 5.11 and later because with IORING_ENTER_EXT_ARG set, liburing can pass a pointer to the timespec struct (with the timeout duration) directly into the io_uring_enter syscall to achieve a timeout. Before 5.11, to do the same thing, I believe liburing creates its own timeout operations and consumes the completion events, which can lead to this case.