axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.89k stars 407 forks source link

Big SQE usage for IO passthrough #1271

Closed hansenidden18 closed 3 weeks ago

hansenidden18 commented 1 month ago

Dear Developers,

To note I am running it in Linux-6.8.0 and liburing-2.5. I am trying to use Big SQE to take advantage 128 bytes of the SQE and use the sqe->cmd for IO pasthrough. But after I initialize my ring with the IORING_SETUP_SQE128 | IORING_SETUP_CQE32 and try to print the SQE size and sqe->cmd size, the result is still 64 and 0. That makes me not able to use the sqe->cmd. Can you help? Thank you in advance!!

int iouring_init(unsigned entries, struct io_uring *ring)
{
    struct io_uring_params params = { };

    params.flags = IORING_SETUP_SQE128;
    params.flags |= IORING_SETUP_CQE32;
    int ret = io_uring_queue_init_params(entries, ring, &params);
    if (ret < 0) {
        perror("io_uring_queue_init_params failed");
        return ret;
    }

    if (!(params.flags & IORING_SETUP_SQE128)) {
        fprintf(stderr, "Warning: Big SQE (128 bytes) not being used\n");
    } else {
        printf("Big SQEs (128 bytes) are in use\n");
    }

    return 0;
}

my sqe initialization is like this

int iouring_enqueue(struct io_uring *ring, int fd,
                ComputeCmd *cmd) {
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    if (!sqe) {
        fprintf(stderr, "io_uring_get_sqe failed\n");
        return -1;
    }
    printf("SQE Size: %zu\n", sizeof(struct io_uring_sqe));

    sqe->opcode = IORING_OP_URING_CMD;
    sqe->fd = fd;
    sqe->cmd_op = NVME_URING_CMD_IO;

    printf("SQE Size: %lu\n", sizeof(*sqe));
    memcpy(sqe->cmd, cmd, sizeof(ComputeCmd));

    printf("SQE cmd field size: %zu\n", sizeof(sqe->cmd));
    sqe->user_data = 1;
    return 0;
}

And the output will be

image

And here I tried it for read-write as well with the sqe->cmd same with the test inside the repo.

int iouring_enqueue(struct io_uring *ring, int fd,
        ComputeCmd *cmd, int is_write) {
    struct nvme_uring_cmd *sqe_cmd;
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    if (!sqe) {
        fprintf(stderr, "io_uring_get_sqe failed\n");
    return -1;
    }
    printf("SQE Size: %zu\n", sizeof(struct io_uring_sqe));
    printf("Uring CMD Size: %zu\n", sizeof(struct nvme_uring_cmd));
    sqe->opcode = IORING_OP_URING_CMD;
    sqe->fd = fd;
    sqe->cmd_op = NVME_URING_CMD_IO;

    printf("SQE Size: %lu\n", sizeof(*sqe));
    sqe_cmd = (struct nvme_uring_cmd *)sqe->cmd;
    memset(sqe_cmd, 0, sizeof(struct nvme_uring_cmd));

    sqe_cmd->opcode = cmd->opcode;
    sqe_cmd->nsid = 1;
    sqe_cmd->cdw10 = cmd->slba & 0xffffffff;
    sqe_cmd->cdw11 = cmd->slba >> 32;
    sqe_cmd->cdw12 = cmd->nlb;
    sqe_cmd->addr = cmd->addr;
    sqe_cmd->data_len = cmd->data_len;

    printf("SQE cmd field size: %zu\n", sizeof(sqe->cmd));
    sqe->user_data = (__u64)(uintptr_t)cmd;
    return 0;
}

the result is

image
axboe commented 1 month ago

The io_uring_sqe struct is fixed, it doesn't matter how you setup the ring in terms of that, it won't change the definition of it. That would not be possible. Same goes for the cqe.

But if you setup with IORING_SETUP_SQE128, the kernel will increment the sqring head by 2 for every consumed sqe. This means that you have a full io_uring_sqe worth of space BEHIND the sqe that you retrieve. Ditto for IORING_SETUP_CQE32, a single cqe will take up two slots in the cq ring, making it twice the size of the io_uring_cqe struct itself.

hansenidden18 commented 1 month ago

Thank you for the kind response. I have a follow up questions. When I use io_uring, which is asynchronous, and compare it with ioctl, the result does not differ so much. Is it because I only call 1 ioctl and compare it with 1 iouring passthru?

axboe commented 1 month ago

If you don't build up some parallelism on the device side with io_uring, then you're just trading an odd ioctl for a submit+wait operation - the device side will be the same. So yes, to reap any benefits here (outside of using features like fixed/registered buffers), you'll also want to be driving more than a single IO at the time.

hansenidden18 commented 1 month ago

But in one of the slide in linux plumber it shows that a single job can outperform the ioctl itself. I tested it with 1 - 128 QD but the results I got was not as good as the presentation, do you know why? Thank you.

image
axboe commented 1 month ago

I'm not saying you need multiple jobs, I'm saying you need parallelism to get higher performance. If you do io_uring passthrough and use registered buffers, then even a single pending IO should be faster than the ioctl approach. Or if you have more pending, that would do it too, as there's no way to do that with the ioctl. And obviously a combination of both would be even better.

I can't say why your test doesn't perform, you're most likely missing one (or both) of those points.