Closed beef9999 closed 1 month ago
Great work, this test need proper discussion, I think you should post on HN.
Great work, this test need proper discussion, I think you should post on HN.
What is HN?
Great work, this test need proper discussion, I think you should post on HN.
What is HN?
Did this ever get posted there? I also agree someone should post it (ideally @beef9999 if they want)
I can also post it, but I don't want that to come off as "Sure, I'll take all those upvotes for your hard work." since it's like two seconds to submit a post.
@GavinRay97 I don’t have an account of that forum. It’s OK if you post it for me. But please wait until this weekend so I can make some modifications on the performance data, and upload the full test code as well.
i could post for you but since i don't want to take credit you should @beef9999 . Its easy to register there (user and pass only no email needed ) and it is arguably best community of tech people. that forum is backed by world top startup accelerator called YCombinator where a lot of tech people from google , FAANG , and Unicorn startups and big companies are there. io_uring is big interest there too.
Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?
For the performance side, try and set IORING_SETUP_DEFER_TASKRUN when the ring is created. That has shown nice results for this kind of workload recently.
Here's one from this week: https://lore.kernel.org/io-uring/949fdb8e-bd12-03dc-05c6-c972f26ec0ec@samba.org/
Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?
(Personally) I like to share/evangelize stuff by people I think is interesting and deserves attention, or that other people might find interesting.
They seem to be pretty keen on performance stuff and io_uring in general, though there's a (rightfully so) certain rigor expected if you're going to post benchmarks.
Even if a particular topic doesn't trend well or some people post negative comments, it's nice for the folks browsing that are interested in that thing that otherwise wouldn't have known about it IMO.
Sometimes I find posts where I have a highly positive opinion of the thing/think it's neat and nobody else does. Oh well, their loss.
That's my $0.02 at least
I'm just not a fan, most of the commentary (to me) are from folks looking to look smart and not knowing a lot of the details. In many ways, not that different from reddit. Not useful imho, from the cases I've seen. Arguably, I haven't spent a lot of time on the site, this is just my experience from the couple of times when I have.
Not sure why it's so interesting to post on HN, honestly most of the commentary there is vitriol and not very useful. What are we trying to accomplish?
(Personally) I like to share/evangelize stuff by people I think is interesting and deserves attention, or that other people might find interesting.
They seem to be pretty keen on performance stuff and io_uring in general, though there's a (rightfully so) certain rigor expected if you're going to post benchmarks.
Even if a particular topic doesn't trend well or some people post negative comments, it's nice for the folks browsing that are interested in that thing that otherwise wouldn't have known about it IMO.
Sometimes I find posts where I have a highly positive opinion of the thing/think it's neat and nobody else does. Oh well, their loss.
That's my $0.02 at least
Yeah , same reason , and that community that is quite interested in io_uring , sharing and discussiong @axboe 's tweets like everyweek , and that how i found out about io_uring too. Reddit used to be good and intellect now it is quite the opposite, no real discussion going there.
@GavinRay97 I have simplified the tests and rephrased some explanations. Please help post it if you are convenient.
@beef9999 I have posted it at A performance review of io_uring vs. epoll for standard/streamed socket traffic 👍
Hopefully some people find it interesting
This is interesting. Thank you for this. I wrote an epoll echo server which multiplexes multiple clients over each thread. The idea is that each core can scale the number of clients it serves. I want to add io_uring maybe I can learn it from this repository
I wonder what how is the performance when multiple cores are run.
Its kind of similar to libuv. I use IO threads to handle IO. It's incomplete though but a proof of idea.
https://github.com/samsquire/epoll-server
It is based on a multiconsumer multiproducer RingBuffer by Alexander Krizhanovsky.
https://www.linuxjournal.com/content/lock-free-multi-produce...
I also wrote a userspace 1:M:N lightweight thread scheduler which should be integrated with the epoll server. This is an alternative to coroutines. I multiplex multiple lightweight threads on a kernel thread and switch between them fast. The scheduler thread preempts hot for and while loops by setting the looping variable to the limit. This allows preemption to occur when the code finished the current iteration. This is why I call it userspace preemption.
https://github.com/samsquire/preemptible-thread
One idea I have for even higher performance is to split sending and receiving to their own threads and multiplex sending and receiving across threads. This means you can scale sending and receiving.
Tried to compile this as I'm pretty convinced something is amiss with the single thread performance, but it fails for me:
Consolidate compiler generated dependencies of target photon_obj
[ 1%] Building CXX object CMakeFiles/photon_obj.dir/io/signal.cpp.o
/home/axboe/git/PhotonLibOS/io/signal.cpp:259:9: error: use of undeclared identifier 'pthread_atfork'
pthread_atfork(nullptr, nullptr, &fork_hook_child);
^
1 error generated.
make[2]: *** [CMakeFiles/photon_obj.dir/build.make:440: CMakeFiles/photon_obj.dir/io/signal.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:104: CMakeFiles/photon_obj.dir/all] Error 2
make: *** [Makefile:111: all] Error 2
I'm on debian testing. Outside of that, I failed to find examples of how to run it? Maybe I'm just blind, but hints would be appreciated.
OK got it going, and the examples built. signal.ccp is missing a pthread.h include.
juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.
juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.
I agree, it all looks very odd to me.
Got it built and running, but there are no docs on how to run with the various backends on either the client or server side. The interval thing doesn't seem to work either, it always keeps running without dumping stats until the client is killed/interrupted.
Will be happy to take a look at the perf differences, but I don't want to spend ages figuring out how to run this thing. Please provide examples, I can't find any.
The interval thing doesn't seem to work either, it always keeps running without dumping stats until the client is killed/interrupted.
It appears to be just a NGINX-like static server, which defaults to an epoll
backend:
[user@MSI PhotonLibOS]$ ./build/output/server_perf
2022/11/07 05:54:32|INFO |th=0000000000B76050|/home/user/projects/PhotonLibOS/io/epoll.cpp:289|new_epoll_engine:Init event engine: epoll
2022/11/07 05:54:33|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0
2022/11/07 05:54:34|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0
2022/11/07 05:54:35|INFO |th=00007FBC8FFCEB00|/home/user/projects/PhotonLibOS/net/http/test/server_perf.cpp:44|show_qps_loop:qps: 0
I think you're meant to use something like k6
/wrk2
to send an HTTP load test to the URL it's running at, which seems to be http://localhost:19876 by default. I thought it would have generated some load/throughput by itself.
It seems you are meant to run the client-perf.cpp
binary alongside the server-perf.cpp
one, and it will generate the HTTP requests.
I see in the docs that you can switch the epoll
engine out for io_uring
, but I don't seem to be able to do that.
What I've done was:
-DENABLE_URING
server-perf.cpp
's main()
:
// I think this tries to initialize some global event engine?
int ret = photon::init(photon::INIT_EVENT_IOURING, photon::INIT_IO_LIBAIO);
if (ret != 0) {
LOG_ERRNO_RETURN(0, -1, "photon init failed");
}
// Replaced this with io_uring specific method
auto tcpserv = net::new_iouring_tcp_server();
// Specified `io_uring` engine for FS
auto fs = fs::new_localfs_adaptor(".", photon::fs::ioengine_iouring);
This still logs as using the epoll
engine though 🙁
I also had to modify a few things to get it to build:
#include <pthreads>
needed in one of the headersinclude_directories(${GTEST_INCLUDE_DIR} ${GMOCK_INCLUDE_DIR} ${GFLAGS_INCLUDE_DIR})
DIRS
instead of DIR
for meYes, C++ programs are very sensitive to the environment, platform specific… We only tested the compiling on CentOS and Ubuntu before, didn’t have the pthread header problem.
I’ll add some instructions about how to run the program with appropriate parameters.
Hi, Everyone. I have updated this issue and added the how to reproduce
instructions.
About test 2, I deleted this line
In order to ease the server's pressure (for it only enabled one core), I added a 10 μs sleep in the client's send/recv loop.
It's not a MUST DO. I just added in my own code.
juxtaposing the performance variance between epoll and io_uring for 512 + 1 client in test 1... vs equivalent performance in test 2 with that usleep... my intuition is all the test 2 data are poisoned.
That's because stress is high in streaming mode, only one client could almost occupied the server CPU (one core). So I figured out this method to reduce server stress.
Yes, C++ programs are very sensitive to the environment, platform specific… We only tested the compiling on CentOS and Ubuntu before, didn’t have the pthread header problem.
If it's any help, I am running on Fedora 37, compiling with Clang 15, and GCC 12 toolchain (/usr/include/c++/12/
)
@GavinRay97 Are you able to reproduce my data for test 1 ?
With the actual instructions, I gave it a test spin. From a quick look, you're doing a lot more on the io_uring side than you are on the epoll side. I made the following 2 minute tweaks:
and got a 50% increase from that alone. I'm sure there's a lot more that could be done, but I'm pretty skeptical that this is a apples-to-apples epoll vs io_uring test case as it is. Other notes:
Another note - lots of receives will have cflags == 0x04 == IORING_CQE_F_SOCK_NONEMPTY, meaning that the socket still had more data after this receive? Is this really a ping-pong test, or is it just blasting data in both directions?
We're also spending a ton of time in __vdso_gettimeofday() when run with io_uring, and I see nothing if using epoll. This is about ~10% of the time spent! It's coming off resume_threads().
I'm not going to spend more time on this, there are vast differences between what is being run here and I think some debugging and checking+optimizing of the io_uring side would go a long way toward improving the single thread / single connection disparity.
@axboe Thanks for your time. Try to answer some of your questions:
Why read on a socket? Because the Linux manual says read is identical to recv in terms of socket. Didn't know io_uring has this specialty.
How are buffers managed? Is it the same on epoll vs io_uring? They are the same. Both allocated on stack. Didn't register to io_uring, or epoll eighter.
What are the linked timeouts doing?
To replace io_uring_submit_and_wait_timeout
, because this bug (https://github.com/axboe/liburing/issues/531) I reported before was only merged in to the latest kernel. Didn't have chance to upgrade my kernel yet.
io_uring_submit_and_wait_timeout
was invoked in the coroutine scheduling, now I wrote these code to replace it.
__kernel_timespec ts = get_timeout_ts();
io_uring_prep_timeout(sqe, &ts, 1, 0);
io_uring_submit_and_wait(ring, 1);
Why there is performance disparity between these two approaches?
I'd like to say something about why this test ever exists. Because unlike the traditional usage, if you need to pipeline the socket IO ( or technically said, concurrent read/write ), then an event engine is a necessary technology. You can hardly find a mature async event engine driven by io_uring in the open source world nowadays, except ours. And I think that's why people didn't meet the streaming client performance issue before.
I believe our old epoll event engine has been optimized quite well, otherwise it wouldn't be able to surpass other IO engine in performance. According to our tests, in streaming mode, boost::asio can only got 50% throughput of ours. What I mean is the upper limit is high.
Another interesting thing to mention is that if I use nonblocking fd
+ io_uring poll
+ psync read/write
, the performance would still be rising to epoll as well. That means my io_uring event engine is proven to be capable.
Anyway, I'll keep on optimizing the io_uring code based on your notes. Thank you.
Updated on Nov. 8, I upgraded my kernel to 6.0.7.
io_uring_submit_and_wait_timeout
. The timer is slow, indeed. But there is still a huge gap from 660K to epoll's 1200K. I don't think any trivial optimization would cover this gap.
Backgroud: io_uring vs epoll
Nowadays there are many issues and projects focused on io_uring network performance, and the competitor is always epoll.
However most of the tests are merely demos, and lack of verification in a production scenario. So I started to integrate io_uring socket into our C++ coroutine library, and did some full evaluations on it. By the way, all the coroutines are running in a single OS thread, which fits io_uring event model quite well.
Network workloads
In my opinion, there are basically two types of workloads in the network. Althrough generated by two different clients, a typical
echo server
could handle both.Ping-Pong mode client
This is most echo clients look like. The clients will be continueously sending and receving requests in a loop.
Streaming mode client
Streaming clients is not rare to see. It means multiple channels will be multiplexing on a single connection, for instance,
RPC
andHTTP 2.0
. Usually it doesn't have too many clients, but the throughput could be high. Below is an approach to simulate streaming workloads. Send coroutine and recv coroutine are running their loops separately.This example might be a little bit extreme, but with good simplicity. In real scenario, multiple coroutines will do ping-pong send/recv in their own loops. Because the excution contexts of coroutine would keep switching, so if you observe the network on any side of the full duplex socket, you will see that the channel has been filled with packets. So this scenario is basically the same as the above example code.
Implementations
non-blocking fd
+epoll_wait
+psync read/write
Quick conclusion
There are two ways to minish the performance gap.
And an aternative to bypass the problem.
non-blocking fd
+io_uring poll
+psync read/write
Note that this article will NOT disscuss the Ping-pong mode, because io_uring can always surpass epoll in this situation. I just want to throw out a question in terms of why io_uring is sometimes slower in the Streaming mode.
Environment
Two VMs in a cloud environment, Intel Xeon 8369B 2.70GHz, 96 cores 128GB, 40Gb network bandwidth. CentOS 8, Kernel 6.0.7-1.el8,
IORING_FEAT_FAST_POLL
is enabled by default.Test 1, Echo server performance (streaming client, single connection)
Note that I only setup one client, and there is only one connection within it.
The QPS is shown in the terminal. The throughput is observed by
iftop
.Conclusions:
Test 2, Echo server performance (streaming client, multiple connections)
Note that I will setup multiple client processes this time. One connection per client, as before.
outdated dataConclusions
Test 3, io_uring IO vs psync IO (with memory backend, and IO depth = 1)
In this test, I just want to verify an idea that when IO backend is in memory, psync stack is more efficient that io_uring stack.
Not providing source code here, but you can create a normal file under
/dev/shm/
(tmpfs) and use io_uring to write it (with 1 concurrency). Don't do reads because I'm not sure if page cache would affect performance. Eventually you will find psync is 3~4 times faster than io_uring.The result is easy to understand. When your data is all in memory, psync IO stack is almost like doing memcpy. And with only 1 concurrency (IO depth = 1), you will centainly find that the io_uring's async event system not leveraging its full power.
Network buffer is similar to this situation, and for a specific fd/connection, the IO depth is always 1. So perhaps when there is still free network buffer to write to, or there is still data to read from, we should consider using psync stack.
Final Conclusions
How to solve this problem?
From a user's perspective, my idea to solve this performance issue of io_uring is like below:
The
new_io_uring_read
means that the kernel will still execute a FAST_POLL for this non-blocking fd, and return cqe after the next read finished.Because for most of the time, the network buffer will be able to read, so this would leverage psync efficiency while utilize iouring FAST_POLL read at the same time.
But unfortunately there is no such a kernel to provide this behavior by far. I'll ask some kernel guys for help and re-test it later.
Appendix
Architecture of coroutine based net server
Both of io_uring server and epoll server have a frontend and backend. The frontend is responsible for submiting async IO (and start polling), and falls into sleep in current coroutine. The backend will be running an event engine, and awake the sleeping coroutine when IO finished.
io_uring server
epoll server
How to reproduce
Test code
The full test code is here. You are welcome to run it in your own environment.
Build
Run epoll server
Run epoll client
Run io_uring server
You will need to modify some code to switch to io_uring server.
photon::INIT_EVENT_EPOLL
tophoton::INIT_EVENT_IOURING
https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L245
new_tcp_socket_server
, use thenew_iouring_tcp_server
in the next line.https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L177-L178
Run io_uring client
You will need to modify some code to switch to io_uring client. Of course, you may still use epoll client to test againt io_uring server, in order to reduce variables.
photon::INIT_EVENT_EPOLL
tophoton::INIT_EVENT_IOURING
https://github.com/alibaba/PhotonLibOS/blob/f858f0a8d7e507c4d3667f0cc7da023600f46e8f/examples/perf/net-perf.cpp#L245
new_tcp_socket_client
tonew_iouring_tcp_client
https://github.com/alibaba/PhotonLibOS/blob/e07ce42648864528f0724b6c339d17317a4003c9/examples/perf/net-perf.cpp#L119
How to setup multiple clients
I just wrote a batch script to make them running in the background.