axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.72k stars 393 forks source link

Completion of Multishot Accept returns invalid fd without error. #1062

Closed mjasny closed 5 months ago

mjasny commented 5 months ago

Hi,

While writing an io_uring server, I came across weird behavior for completions of multishot_accept. The CQE of the second multishot accept returns a positive number for cqe->res; however, this file descriptor is invalid. (The fd is always 195 on a few machines that I tested).

This only occurs if defer-task run is enabled and the first accepted client continuously pushes data to its socket. Then, the second client connects after some delay and the the server gets a wrong fd in the CQE of the multishot-accept. When I alter the setup, e.g. using singleshot-accept, coop-taskrun/sqpoll or multishot-accept all new fds of connecting clients are valid. Also, if the second client connects immediately without a delay, thus having no load on the server yet, the fd is also valid,

So, the high-level experiment consists of the following steps:

Here is a small test code that narrows down the problem I observed. It spawns a server thread and two client threads, which connect via localhost. The code can be compiled with or without optimizations.

Affected Kernels: 6.8.0-rc3, 6.7.4, 6.6.9, 6.6.1 Working Kernel: 6.2.0

Link to code: https://gist.github.com/mjasny/c387bd3536d5b572c676dba6cb1c8062 Link against liburing master and run with ./fd_bug -d 1 -s 0 -m 1.

Server: mshot_accept=1, sqpoll=0, defer_tw=1
Server ready
Client 1 connected
New client: 0/6 139690254536576
Clients started
Client 2 connected
*** fd: 195 error: Bad file descriptor ***
New client: 1/195 139690254536624
multishot accept cancelled res=195
cqe ud: 139690254536624 res: -9 Bad file descriptor

The -9 Bad file descriptor error (last line) in the test comes from the recv that was issued for the bad file descriptor 195 (same userdata 139690254536624).

Note: When I startup valgrind with the example it segfaults internally and crashes.

axboe commented 5 months ago

Thanks for the report with a reproducer - I'll take a look.

axboe commented 5 months ago

Looks like you're hitting CQE overflow when posting the multishot accept CQE, and this is triggering a bug in how multishot accept deals with an overflow condition. I didn't look too much at your application example, but that's a lot of CQEs that are being posted and not handled in a timely fashion. I suspect the task_work fairness changes that I did for 6.9 will help with that. I'll check, and link the fix for the overflow condition with accept, in a bit.

axboe commented 5 months ago

After the fix:

axboe@m2max-kvm ~> ./iouring_fd_bug -d1 -s0 -m1
Server: mshot_accept=1, sqpoll=0, defer_tw=1
Server ready
Client 1 connected
accept: new fd 6
New client: 0/6 281472829230976
Clients started
Client 2 connected
accept: new fd 8
New client: 1/8 281472829231024
multishot accept cancelled res=8

still hitting overflow (which the example doesn't handle, you'd need to re-arm at that point). Here's the overflow fix:

diff --git a/io_uring/net.c b/io_uring/net.c
index 43bc9a5f96f9..161622029147 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1372,7 +1372,7 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
             * has already been done
             */
            if (issue_flags & IO_URING_F_MULTISHOT)
-               ret = IOU_ISSUE_SKIP_COMPLETE;
+               return IOU_ISSUE_SKIP_COMPLETE;
            return ret;
        }
        if (ret == -ERESTARTSYS)
@@ -1397,7 +1397,8 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
                ret, IORING_CQE_F_MORE))
        goto retry;

-   return -ECANCELED;
+   io_req_set_res(req, ret, 0);
+   return IOU_STOP_MULTISHOT;
 }

 int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
axboe commented 5 months ago

I suspect your completion loop is getting tons of CQEs in one go, which is nice, but may be too much to deal with the flooding senders. Maybe if you get rid of the io_uring_submit() in server_handle_cqe(), and switch to using io_uring_submit_and_wait() rather than io_uring_wait_cqes() for the !sqpoll case it'd behave better. You really never want to hit overflow condition. It'll work (barring this bug, obviously...), but it'll be slower than never hitting overflow.

axboe commented 5 months ago

Posted the fix:

https://lore.kernel.org/io-uring/d674f450-8090-4217-9ccc-41a4342e7ef8@kernel.dk/

mjasny commented 5 months ago

Thanks for the quick fix, it works as expected! I also see that the multishot_accept is unarmed for its second CQE.

Out of curiosity I tried to connect additional clients after the multishot-accept was unarmed. It looks like their connect() call returned and thus the socket was opened. The then 3rd client could also push data into the socket. From my understanding this should not have happened because the server only sends an SYN-ACK back if accept() was called. Maybe the multishot-accept is somehow active in the background? I'll take a more detailed look later, but just wanted to leave this thought here.

axboe commented 5 months ago

Let me take a look at that one as well. Once the multishot is canceled, the fd it did return with that is of course valid. It should not accept any further connections, it should be canceled at that point.

axboe commented 5 months ago

Isn't this expected? If you don't even queue an accept and you just have the listen done, you'd be able to connect and send data. So I don't think there's anything there, it'd be the same if you just had a non io_uring setup where you do socket/bind/listen, and then did connects.

axboe commented 5 months ago

In other words, if you have N clients connecting post listen(2) but before the accept is running, you'll get N immediate accepts completing when it is armed. Ditto for the multishot case that gets terminated, you arm and submit a new one and then it'll get the pending connections.

That's for the overflow case, which you really should try and avoid!

mjasny commented 5 months ago

Yes, you are right it is working as it should. The listen(N) puts these N clients into the backlog, only after that clients timeout to connect. Sorry for the confusion.

axboe commented 5 months ago

Other fix is queued here:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-6.8&id=4f17abb6b395227fc1d5c4dbeccc53347d2c0d97

and it'll go into 6.8-rc5 and bubble back to stable. I'll close this one, thanks for the report. Nothing better than a bug report with a reproducer, makes almost any bug trivially fixable in no time.