axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

Fairness of recv_multishot between multiple connections #1043

Closed mjasny closed 6 months ago

mjasny commented 7 months ago

Hi,

I've written a simple io_uring-based TCP server that is just receiving data from multiple clients using recv_multishot and buf_ring. I’m observing that the server is handling 3 clients in a fair manner, where the BW is roughly split equally between each client. However, when more clients try to connect, the server can successfully accept them and register a multishot_recv, but from there on it does not receive any data. The clients appear to stall with an open connection. Over time this does not recover and the new clients do not get a portion of the available bandwidth, it remains at 0 Bytes/sec. The only messages they exchange with the server machine are TCP ZeroWindow and Keep-Alive messages (see attached pcap file). My expectation would be that the available bandwidth is redistributed equally across all connected clients as it was the case for the first 3 clients.

I'm using liburing from master and Linux Kernel 6.1. On 3 AWS m5dn.2 instances with 100Gbit networking the problem occurs already on the 4th connecting client. With the settings from the attached source-code the bandwidth tops at around 2GiB/s and is neither message or bandwidth bound.

Clients are simulated with: cat /dev/zero | pv | nc 10.0.1.71 4444 Server: iouring_multishot_recv.c

I can provide additional details if needed.

iouring_multishot_recv.c.txt tcp_stall.pcap.txt (Remove .txt ending)

axboe commented 7 months ago

Interesting, we've tested hundreds of thousands connections and haven't seen anything like that. Waiting on my test box to be available later today, then I'll give your reproducer a spin. Just out of curiosity, does it happen if you don't use SQPOLL? Eg setup the ring with IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN instead?

mjasny commented 7 months ago

In case you have a single machine: I could not trigger this behavior using only the loopback interface on my machine, it just happens when I use 100G+ links between separate physical machines.

Without SQPOLL and io_uring_wait_cqe in the main-loop I get No buffer space available when the 5th client connects, even though the buffers are directly pushed back into the buf_ring. A quick hack to use separate buf_rings for each connection also resulted in No buffer space available.

Tomorrow morning I'll dig deeper and try to reproduce the behavior from above with SQPOLL disabled.

axboe commented 7 months ago

I am using two machines. My initial suspicion would be that our internal multishot retry will just keep firing as data is flooding in, which can cause an imbalance I bet. I'll know once I get the other test box back up.

Your example doesn't really work without heavy regular task_work setups, as you never wait for a cqe. It just assumes that they will be posted, and hence it busy loops searching for new CQEs. Not hard to fix, just saying it won't work as-is because of this busy looping.

I'll come back with more details later today.

axboe commented 7 months ago

OK test box is up so I ran some testing, and I think it is indeed because of the internal multishot retries. I have a patch that works for me - my setup is pretty basic, so it's just a 10gbit link. And then I run 8 clients from another box. Before I did see discrepancies between the throughput of each client, seems fine now. Running right now and getting 140MiB/sec exactly from each client.

You're on 6.1, are you able to build a kernel for testing? I can prep the patch for 6.1, or you can jump to something newer. Let me know what is the easiest for you!

mjasny commented 7 months ago

I verified that I can build and boot the newest linux kernel from the master branch. For me it would be fine to test against this version. Can I use the patches directly from the mailing-list or are some patches in-between necessary? https://lore.kernel.org/io-uring/20240129203025.3214152-1-axboe@kernel.dk/T/#t

mjasny commented 7 months ago

Thank you for your help. I tested your patch from the list above. The throughput is now evenly shared between each client, also if they are located on separate physical machines.

However, when I start a 5th client (id:9) as shown in the screenshots below, the server does not receive anything from it until I terminate one of the other clients. The packet-trace of the stalling client is the same as in my initial mail, this part did unfortunately not improve. I'm still using the same code for the io_uring server.

I would be happy to further debug this issue with your help :)

4 clients running, each get equal 500MiB/s, 5th client gets nothing. 2024-01-30-144410_2555x1368_scrot

After disconnecting one client, the 5th client starts sending data. 2024-01-30-144446_2557x1369_scrot

axboe commented 7 months ago

I'm digging into this, just a heads-up as I'm working on this to fully understand what's going on here. Spent most of today on it, and I'm a little wiser, but don't understand the underlying problem just yet. I'll be back!

axboe commented 7 months ago

Just for kicks, can you try and pull in:

git://git.kernel.dk/linux for-6.9/io_uring

into your current branch, and see if that improves the situation for you?

mjasny commented 7 months ago

I pulled the most recent commit (ab2162895e46ad9dd656257fbdde67ee2f8df3e7) of the for-6.9/io_uring branch. Unfortunately that did not improve the situation, I see exactly the same behavior.

Also, I can provide you with access to my AWS testing instances if that would be helpful.

axboe commented 7 months ago

This is still using SQPOLL when setting up the ring, right?

Also, I can provide you with access to my AWS testing instances if that would be helpful.

That might be useful indeed. I've been running on local test boxes and I do see an imbalance if the client is flooded with bandwidth, but if we're packet limited instead then it seems balanced. That's going up to 64 clients, haven't done more than that.

mjasny commented 7 months ago

Yes, IORING_SETUP_SQPOLL and IORING_SETUP_SINGLE_ISSUER is set.

I've sent you a private mail to exchange SSH-keys and IP addresses.

axboe commented 7 months ago

Just a heads-up that I didn't receive any emails. You can just use axboe@kernel.dk - dunno which one you used?

mjasny commented 7 months ago

That's weird, yes I used: axboe@kernel.dk and it went out. I resent the message now from another mail.

axboe commented 7 months ago

Got that one, thanks!