axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

Buffer size and `io_uring` ring queue depth performance boost #903

Closed rvineet02 closed 3 weeks ago

rvineet02 commented 1 year ago

Hi, I'm hoping to take advantage of io_uring to improve throughput for a networking i/o application. I am running the applications on Ubuntu, kernel version: 5.15.0-60-generic.

A gist with my client and server implementations can be found here.

When running with 128 threads, I am able to saturate the network, maximizing the throughput. However, I would like to maximize the throughput with fewer threads - enabling me to run the application on less beefy machines.

In order to achieve this, I reasoned that increasing the buffer size and/or the entries value in io_uring_queue_init should increase the number of bytes being sent across the network. But this is not the case. The throughput stays the same when varying either the buffer size or the queue depth in the ring (increasing the buffer size and queue depth by 2x until 64k).

I was wondering if there is some issue in client/server io_uring implementation.

Please let me know if you would like the profiling output from perf.

ammarfaizi2 commented 1 year ago

In order to achieve this, I reasoned that increasing the buffer size and/or the entries value in io_uring_queue_init should increase the number of bytes being sent across the network.

I think I spotted a mistake in the recv part:

    #define BUFFER_SIZE 16384

    // create a buffer
    char buffer[BUFFER_SIZE];
    struct iovec iov = {
            .iov_base = buffer,
            .iov_len = sizeof(buffer)};

    // prepare the readv operation
    io_uring_prep_recv(sqe, client_sock, &iov, 1, 0);

No matter how big your BUFFER_SIZE is, it will only read 1 byte from the socket. recv() is not readv(). The 4-th argument here is the number of bytes you're willing to read from the socket. Not the number of iovecs in the array.

Your comment indicates that you want to use readv(), but the code actually uses recv().

If you use io_uring_prep_recv(), then it should have been like this: (no struct iovec)

    #define BUFFER_SIZE 16384

    // create a buffer
    char buffer[BUFFER_SIZE];

    // prepare the recv operation
    io_uring_prep_recv(sqe, client_sock, buffer, BUFFER_SIZE, 0);

Also, note: You don't have to use readv() for reading from the socket, just use recv(). recv() performs better for socket operation in io_uring. It's specialized to handle sockets. The same goes for send() vs writev().

rvineet02 commented 1 year ago

Thanks for the catch 👍

EDIT: would it make sense to do the same for the client as well? Something like this:

io_uring_prep_writev(sqe, sock, &iov, some_val, 0);

versus what I have currently:

io_uring_prep_writev(sqe, sock, &iov, 1, 0);

In effect, does increasing the number of vecs improve performance?

rvineet02 commented 1 year ago

sorry for the close/re-open - clicked it on accident

ammarfaizi2 commented 1 year ago

Thanks for the catch +1

EDIT: would it make sense to do the same for the client as well? Something like this:

io_uring_prep_writev(sqe, sock, &iov, BUFFER_SIZE, 0);

versus what I have currently:

io_uring_prep_writev(sqe, sock, &iov, 1, 0);

What you have currently with writev() is ok. But I would suggest using send(), so it will be like this:

    io_uring_prep_send(sqe, sock, buffer, BUFFER_SIZE, 0);

You can remove your struct iovec that way.

rvineet02 commented 1 year ago

When I modify the server with the BUFFER_SIZE parameter instead, I get a seg fault.

I updated the code in the gist. I am getting a seg-fault when attempting to wait a completion event.

Getting a null-ptr dereference exception at this line:

        ret = io_uring_wait_cqe(&ring, &cqe);

The same happens when I try to use peek as well.

axboe commented 1 year ago

You probably have a bad install with a mix of distro liburing packages and headers and ones you installed from source yourself. Clean that out and stick with one version where the headers and library match.

axboe commented 1 year ago

From a quick glance of your gist, you're also still using and iov with recv. It takes the buffer and length, not an iovec. This is probably your crash as well, as you're going to be corrupting your stack when the receive happens.

rvineet02 commented 1 year ago

yup, that was the issue, my bad. But, even after making these changes on both the client and server still seeing roughly the same throughput increasing buffer size.

At this point, could it be the network that is the bottleneck?

alviroiskandar commented 1 year ago

At this point, could it be the network that is the bottleneck?

Your information severely lacks detail. It is evident that you failed to adhere to the given advice regarding the usage of send() and recv(). Surprisingly, you continue to utilize writev() despite the guidance provided. Additionally, your expectation remains unclear.

I insist that you promptly present us with concrete numerical data for the purpose of comparison. And also, provide a functional test code that actually works, once you have made the necessary fixes.

rvineet02 commented 1 year ago

Hi, sorry about the lack of details. I have updated the gist.

I am running all experiments with the following defaults: 1 thread, 2048 ring depth.

Running on Chameleon Cloud, I am able see in an increase in throughput when increasing the buffer size:

$ ./src/client2 -t 1 -q 2048 -b 1024
Total requests sent: 9435589
MB Sent: 9214
Throughput in MB/s: 614

$ ./src/client2 -t 1 -q 2048 -b 2048
Total requests sent: 6137781
MB Sent: 11986
Throughput in MB/s: 799

$ ./src/client2 -t 1 -q 2048 -b 4096
Total requests sent: 3456695
MB Sent: 13497
Throughput in MB/s: 899

$ ./src/client2 -t 1 -q 2048 -b 8192
Total requests sent: 1752411
MB Sent: 13678
Throughput in MB/s: 911

$ ./src/client2 -t 1 -q 2048 -b 16384
Total requests sent: 1008269
MB Sent: 15715
Throughput in MB/s: 1047

$ ./src/client2 -t 1 -q 2048 -b 32768
Total requests sent: 490733
MB Sent: 15256
Throughput in MB/s: 1017

$ ./src/client2 -t 1 -q 2048 -b 65536
Total requests sent: 244174
MB Sent: 15174
Throughput in MB/s: 1047

Running the same experiment on AWS, I get:


$ ./src/client2 -t 1 -q 2048 -b 1024
Total requests sent: 17545260
MB Sent: 17127
Throughput in MB/s: 570

$ ./src/client2 -t 1 -q 2048 -b 2048
Total requests sent: 8778865
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 4096
Total requests sent: 4389254
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 8192
Total requests sent: 2199115
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 16384
Total requests sent: 1102643
MB Sent: 17129
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 32768
Total requests sent: 551337
MB Sent: 17130
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 65536
Total requests sent: 275687
MB Sent: 17131
Throughput in MB/s: 571

What could be the reason that even baseline throughput is much lower on AWS, and then why does the buffer size not affect throughput in this case.

Using iperf3, here are the bitrates for the network on AWS: ~4.69 Gbps and on Chameleon cloud: ~1.81 Gbps.

ryanseipp commented 1 year ago

Given the iperf numbers, it appears you've roughly saturated the network in the AWS case, using just one thread instead of 128. 571 MB/s = 4.56 Gbps.