io_uring_prep_recv return an error

adedel83 commented 1 year ago

Hi, I am currently trying to use IO_Uring in order to receive the ingress queue content on my computer using io_uring_prep_recv from liburing. I am trying to pull its content by asking for the maximum amount of byte possible in order to get what is in it. However, it does not work. I was wondering if there were a problem with my code if IO_Uring does not allow this type of operation.
In addition, when I am using IOSQE_IO_DRAIN, the app is stuck at io_uring_submit_and_wait. Here is an extract of my code (Kernel 6.3):

#define MAX_QUEUE_SIZE 65535 << 14

char recv_queue[MAX_QUEUE_SIZE] = {0};
int repaireON = 1, 
    repaireOFF = 0, 
    idrecv = TCP_RECV_QUEUE, 
    amount = 0,
    ret;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;

if (setsockopt(sock_fd_client, SOL_TCP, TCP_REPAIR, &repaireON, sizeof(int)) < 0)
{
    perror("setsockopt repaire on");
    exit(1);
}
if (setsockopt(sock_fd_client, SOL_TCP, TCP_REPAIR_QUEUE, &idrecv, sizeof(int)) < 0)
{
    perror("setsockopt repaire queue");
    exit(1);
}
sqe = io_uring_get_sqe(&ring);
if(!sqe){
    perror("Error retreving SQE");
    exit(1);
}
sqe->user_data = amount++;
io_uring_prep_recv( sqe, 
            sock_fd_client, 
            recv_queue,
            MAX_QUEUE_SIZE, 
            MSG_PEEK | MSG_DONTWAIT);
io_uring_submit_and_wait(&ring, amount);
do{
    ret = io_uring_wait_cqe(&ring, &cqe);
    if(ret < 0 || cqe->res < 0){
        printf("CQE %lld Failed: %d %d\n", cqe->user_data, ret, cqe->res);
        perror("PERROR ");
        io_uring_cqe_seen(&ring, cqe); 
        exit(1);
    }
    io_uring_cqe_seen(&ring, cqe);
    amount--;
}while(amount > 0);
if (setsockopt(sock_fd_client, SOL_TCP, TCP_REPAIR, &repaireOFF, sizeof(int)) < 0)
{
    perror("setsockopt repaire off");
    exit(1);
}

axboe commented 1 year ago

To save some time on our end, would be great if you could supply a complete code example that could be run to demonstrate the problem.

adedel83 commented 1 year ago

Hi,

Thank you for your answer. I made a repo that reproduce this error :

https://github.com/adedel83/IO_Uring-receive-ingress-queue/

When I run it, the programme is blocked at "io_uring_submit_and_wait" (file src/E.c line 118). Does it help ?

Thank you in advance.

axboe commented 1 year ago

This does look like a bug. You set DRAIN, which means that io-wq will handle the request. If it gets -EAGAIN, which it does on that first peek, then it'll arm poll and wait for data. Even if MSG_DONTWAIT is set. Without drain I suspect it'd work. In general, I would not recommend using DRAIN if it can be avoided, as it won't be as efficient. And it's generally not something I'd expect to see in networking code. Can you expand a bit on why you are using DRAIN and what it's doing for you?

That aside, it's of course a bug and it should get fixed. Can you see if the below kernel patch works for you?

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7505de2428e0..e2d5c432195d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1948,6 +1948,8 @@ void io_wq_submit_work(struct io_wq_work *work)
        ret = io_issue_sqe(req, issue_flags);
        if (ret != -EAGAIN)
            break;
+       if (req->flags & REQ_F_NOWAIT)
+           break;
        /*
         * We can get EAGAIN for iopolled IO even though we're
         * forcing a sync submission from here, since we can't
diff --git a/io_uring/net.c b/io_uring/net.c
index eb1f51ddcb23..fcbde4fce1c8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -801,7 +801,7 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
    }

    flags = sr->msg_flags;
-   if (force_nonblock)
+   if (force_nonblock || req->flags & REQ_F_NOWAIT)
        flags |= MSG_DONTWAIT;

    kmsg->msg.msg_get_inq = 1;
@@ -907,7 +907,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
    msg.msg_flags = 0;

    flags = sr->msg_flags;
-   if (force_nonblock)
+   if (force_nonblock || req->flags & REQ_F_NOWAIT)
        flags |= MSG_DONTWAIT;
    if (flags & MSG_WAITALL)
        min_ret = iov_iter_count(&msg.msg_iter);

axboe commented 1 year ago

Ah, I think that patch can become even simpler, as we are inheriting the MSG_DONTWAIT just fine. This should do it:

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7505de2428e0..a102c687e86f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1948,6 +1948,14 @@ void io_wq_submit_work(struct io_wq_work *work)
        ret = io_issue_sqe(req, issue_flags);
        if (ret != -EAGAIN)
            break;
+
+       /*
+        * If REQ_F_NOWAIT is set, then don't wait or retry with
+        * poll. -EAGAIN is final for that case.
+        */
+       if (req->flags & REQ_F_NOWAIT)
+           break;
+
        /*
         * We can get EAGAIN for iopolled IO even though we're
         * forcing a sync submission from here, since we can't

adedel83 commented 1 year ago

Hi,

Thank you for the patch ! It works really well. You're right, if I remove the DRAIN flag, it does work. However, I think that I have to use it. I'm currently working on an implementation of TCP checkpoint / restore with IO_Uring. I m using a custom patch (https://lore.kernel.org/io-uring/GV1P193MB200533CC9A694C4066F4807CEA6F9@GV1P193MB2005.EURP193.PROD.OUTLOOK.COM/) that add the setsockopt/getsockopt functionality to IO_Uring. The goal is to make the TCP checkpoint implementation faster using IO_Uring instead of syscall. Right now, it's slower than the syscall implementation. I think it might be because I'm using the DRAIN flag. In this implementation I need to make some operation in a defined order (ex : set the ingress queue to repair before receiving it's content ...). That is why I use DRAIN. If you know another way to organize the operations without having to use the DRAIN flag I would gladly take it.

Thanks a lot !

axboe commented 1 year ago

It depends on what you need to order against. Some requests you know will complete inline, and hence issue can never get reordered for them. Doing set/getsockopt requests are like that, they will ALWAYS complete directly without being reordered. Unless you set things like ASYNC in the sqe flags for them, then all bets are off.

So if you do:

sqe1 = get_sqe(); prep_setsockopt(sqe1);

sqe2 = get_sqe(); prep_recv(sqe2);

io_uring_submit();

then sqe1 will ALWAYS issue before sqe2, and since you know that sqe1 will always complete inline, you even know that sqe1 will get issued and completed before sqe2 is even started.

So I suspect you don't need drain at all, really just depends on what requests you are issuing.

adedel83 commented 1 year ago

Hi,

After testing it without the DRAIN flag, the command are not reorder and it is much faster. Thanks a lot for the patch and the explaination on how io_uring works !

adedel83 commented 1 year ago

Hi again,

Sorry for reopening this post (I was not sure if I should have made).

I was wondering how do you know which request will be completed inline and which will not. My problem being: will an inline request always stay an inline request or might this change in the future?

I was also wondering, if it was possible for IO_Uring to detect that the request will be inlie when we use the DRAIN or LINKED flag and execute it normally instead of calling an io-wrk, which will slow the process.

Thanks in advance,

isilence commented 1 year ago

I was wondering how do you know which request will be completed inline and which will not. My problem being: will an inline request always stay an inline request or might this change in the future?

My take would be that the kernel reserves the right to make any request asynchronous. Guessing if it will be executed inline or not is problematic with future io_uring improvements and potentially will make your program break with future kernels.

If it's so interesting for folks, should we have some out of hot path probing? E.g. make IORING_REGISTER_PROBE returns whether opcode supports async or not. Or even sth that would also consider file type and so.

I was also wondering, if it was possible for IO_Uring to detect that the request will be inlie when we use the DRAIN or

Please, never never use IOSQE_IO_DRAIN. It's a huge eyesore for the kernel and to optimise hot paths the draining will be getting slower and slower.

LINKED flag and execute it normally instead of calling an io-wrk, which will slow the process.

Linked requests don't go to io-wq by default, it uses a faster path. Though there are some optimisations to be had if that's interesting.

axboe / liburing

io_uring_prep_recv return an error #897