frevib / io_uring-echo-server

io_uring echo server
MIT License
365 stars 55 forks source link

Wild results, cannot reproduce #8

Open ghost opened 4 years ago

ghost commented 4 years ago

I have tested your epoll and io_uring examples and I get 250k (req/sec) with your epoll example and only 220k with io_uring. I also get 250k with my own epoll implementation so that confirms we are both using efficient use of epoll.

I'm running on Linux 5.7 Clear Linux - do you have any hints on how I can reproduce your results?

ghost commented 4 years ago

When I strace your example I get a lot of these:

io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61

Isn't it supposed to poll without any syscalls to maximize the performance advantage?

ghost commented 4 years ago

I absolutely cannot reproduce your results. I've tested on a completely different machine, with a Fedora Rawhide system with Linux 5.8, mitigations for spectre default. This machine confirms my findings from above; io_uring (this example at least) is not faster than epoll.

I get consistently worse results with io_uring, 227k vs. 235k and the like. I have no idea where you would get 99% performance increase with io_uring like your reddit post claims, and not even 45%. I only get worse results, with or without spectre mitigations. I really cannot reproduce the findings you claim.

frevib commented 4 years ago

Hi Alex,

The 99% increase was with an early version of io_uring and buggy version of liburing. Besides, the performance test tool used had some bugs as well: https://github.com/frevib/io_uring-echo-server/pull/2#discussion_r376728709

So these early results were quite off.

If you use this latest echo server and have support for IORING_FEAT_FAST_POLL, you should in most cases get a peformance increase: https://twitter.com/hielkedv/status/1234135064323280897?s=21

This was at the time using Ubuntu + Linux 5.6 + Jens’ IORING_FEAT_FAST_POLL branch. Current state of io_uring and liburing could have changed the performance. However, using IORING_FEAT_FAST_POLL saves a syscall so theoretically it should be faster.

ghost commented 4 years ago

The test I ran was reporting support of IORING_FEAT_FAST_POLL and yes, I agree that in theory it should be faster since there are a lot fewer syscalls. But in practice, it's not. It even goes the wrong way -> the more clients I add the more the advantage goes for epoll, which is the opposite one would expect since you have more syscalls the more clients you have (epoll_wait, (recv, send) * N).

But I don't care what the theory says, this looks similar to when everyone and their mother was telling me writev was so much faster because it was scatter/gather, but in my testing it was slower than just copying up to 4kb to an append-buffer and write:ing that copy off with the old send syscall.

@axboe Do you have any benchmarks of your own in this regard?

So these early results were quite off.

If you use this latest echo server and have support for IORING_FEAT_FAST_POLL, you should in most cases get a peformance increase

I have run the latest version many times on many different machines and kernels and it does not perform better than epoll.

frevib commented 4 years ago

This echo server also uses IORING_OP_PROVIDE_BUFFERS, which causes a performance drop: https://twitter.com/hielkedv/status/1255492941960949760?s=21

You could try running it without automatic buffer selection, using your own buffer management implementation or no buffer management at all.

ghost commented 4 years ago

I don't have the time to tinker with your code right now - I was hoping to simply confirm or deny your findings, using your provided code (as per anything scientific). So far I cannot confirm it, your findings, and it seems you hint of it not really being possible right now.

I would like to see an example that does in fact prove the efficiency of io_uring over epoll.

frevib commented 4 years ago

The benchmarks provided are without buffer selection: https://github.com/frevib/io_uring-echo-server/blob/master/benchmarks/benchmarks.md

ghost commented 4 years ago

So what commit should I use? You didn't tag any release so I can only check out master.

Mathnerd314 commented 4 years ago

My guess is you'd use this branch: https://github.com/frevib/io_uring-echo-server/tree/io-uring-feat-fast-poll, like he linked in the benchmark description.

frevib commented 4 years ago

This could very well the branch, it’s indeed without buffer selection but with fast poll. I haven’t had the time to look for the exact commit hash.

frevib commented 4 years ago

Maybe try this one: https://github.com/frevib/io_uring-echo-server/tree/b003989ecb6343b5815999527310251601531acc

This commit is right before buffer selection was implemented.

Qix- commented 4 years ago

I'm with @alexhultman on this. I simply cannot reproduce anything close to the wild, game-changing claims others have made.

I allocated a 4 vcpu machine on a google cloud instance. Seeing as how the bechnmarks were run on a VM on with a MacOS host I would imagine this (should) not affect the benchmark so much that epoll is just as fast.

Note: I did test this locally on a Windows host running VMWare player with an Ubuntu guest, but I did not isolate the CPUs so I didn't consider them here. However, I got the exact same results - epoll and uring are neck and neck, all things considered.

I upgraded to mainline 5.8.12-050812-generic, set isolcpus=0,1 and told systemd to affine to 2 and 3. Rebuilt GRUB, rebooted, and ran the benchmarks using taskset -c 0|1 <cmd> <args...>.

After a few runs, I got very mixed results, but they were all within the same range. Sometimes uring achieved more throughput, sometimes epoll. Never with a spread with more than about 2k requests a second, however.

I tried both the io-uring-feat-fast-poll branch and the direct commit in https://github.com/frevib/io_uring-echo-server/issues/8#issuecomment-699671955. Both segfault for me upon boot.

Tests were run with a variety of parameters to the echo server benchmark tool. All of the spreads were roughly the same - +/- 2k req/sec with epoll and uring alternating between the winners over a 60 second period.

I'm beginning to think this is another case of "too good to be true". Just my take.


Further, the linked tweet above is quoting the same benchmark wiki here claiming 68% increase. Sorry, but that's misleading and simply wrong.

frevib commented 4 years ago

To understand these performance gains that the echo server is claiming, maybe some extra context is needed.

Of course echo server is not a real use case, so have a look at some of the "real" implementations like Netty or NodeJS.

ghost commented 4 years ago

If you use 1 connection and 128 bytes the performance increase is minimal.

We know. It's obvious. That's why we tested with 1k connections and epoll was faster (it got a bigger lead the more clients - the opposite of what is claimed).

have a look at some of the "real" implementations like Netty or NodeJS.

Node.js is the most nondeterministic, unjust and imprecise benchmark of a kernel feature. You cannot possibly mean to use a highly bloated JavaScript environment with nondeterministic garbage collection and JIT:ing to reliably benchmark a kernel feature.

jlengrand commented 4 years ago

Then maybe have a look at Netty, if it suits you better as OP suggested? https://github.com/netty/netty/issues/10622#issuecomment-701241587

frevib commented 4 years ago

@alexhultman when I have some time left I will try and find the right commit that does not use IORING_OP_PROVIDE_BUFFERS. I’m quite out of the io_uring scene at the moment, so for now please take this software as-is. I’m also pretty sure there are more ppl who created examples without IORING_OP_PROVIDE_BUFFERS, it really performs better that epoll ☺️

ghost commented 4 years ago

I will try and find the right commit that does not use IORING_OP_PROVIDE_BUFFERS

Great, thanks!

Then maybe have a look at Netty, if it suits you better as OP suggested?

What are you even talking about? I am OP and I have no intention whatsoever in any Java or JavaScript wrapper. This is a kernel feature in C, not a wrapper in some nondeterministic garbage collected virtual machine.

vincentfree commented 4 years ago

Netty is using C via jni and it has an implementation for epoll and now io_uring so it is a good reference to see the difference when used by a library like this.

ghost commented 4 years ago

All applications written in the history of man kind uses C on Linux, it is the main gateway to the kernel. You could make the argument Ruby on Rails makes a good Linux kernel benchmark because, under the hood, it is too C.

Of course, anyone with more than 2 brain cells would see that any such benchmark would massively be tainted by the fact you also drive this whole mountain of bloat, making the benchmark as a whole useless. Only a minimal C client directly using the syscalls / liburing makes sense here.

All of Java is built on JNI at some level (see above logic, it has to), and JNI has a demonstrated 4x FFI overhead due to the fact you are executing in a virtual environment much like an operating system inside an operating system. So you essentially have a measurement stick which gives you a taint of "a shit ton".

It's like measuring the size of an atom using your thumb and a squinted eye.

jlengrand commented 4 years ago

Dude, you made your point. We understand that, and of course you are right. A lifetime ago I worked with embedded systems too. Now we could keep diving, and saying that all applications in the history of mankind compile to assembly. And we would but much further in the discussion...

I'm sorry to be stupid, close minded or whatever you wanna call it, but if systems / libraries / languages as popular as Netty and Node are taking serious interest in io_uring, maybe there is an actual reason. And if most of those report benefits, maybe there is also a reason.

Now, can you please stop the toxicity for more than a second and put the same amount of energy in making a PR that showcases what you mention? @frevib has spent quite a bit of energy into making that repo, trying to be as clear as possible. If you think it's not, we'd all benefit from your improvements. And I'm serious, we would.

We are all educated folks here, can we behave as such and simply try to improve the platform?

1Jo1 commented 4 years ago

you should try IOSQE_ASYNC it will be executed directly in the workerpool, 124% better performance in netty(using non blocking sockets) :)

ghost commented 4 years ago

but if systems / libraries / languages as popular as Netty and Node are taking serious interest in io_uring, maybe there is an actual reason. And if most of those report benefits, maybe there is also a reason.

The Node.js sphere is not driven by logic, it is driven by hype and nothing else. I know this intimately from experience. "io_uring" is hype right now. That's why they added it. Node.js will not be affected one single bit by it, because they have way, way, way more serious problems in other places in their stack. That is why their existing epoll path is executing at less than 10% of what epoll is capable of. So Node.js is the worst possible example of a reliable kernel benchmark.

I'm going to rerun benchmarks when @frevib post the commit.

jlengrand commented 4 years ago

The Node.js sphere is not driven by logic, it is driven by hype and nothing else. I know this intimately from experience. "io_uring" is hype right now. That's why they added it. Node.js will not be affected one single bit by it, because they have way, way, way more serious problems in other places in their stack. That is why their existing epoll path is executing at less than 10% of what epoll is capable of. So Node.js is the worst possible example of a reliable kernel benchmark.

Mostly agree. Also agree the benchmark should run as close from the kernel as possible.

@frevib Let me know how I can help to create something more reproducible. I know how busy you are at work lately.

eecheng87 commented 1 year ago

Hi, all

I still can't reproduce the benchmarking result. Any update about this?

Thanks!

Mathnerd314 commented 1 year ago

As I understand @frevib's responses, this project is dead/abandoned. You can look at other benchmarks:

These results confirm the original observation, that io_uring can offer significant speedups in some cases.

Benchmarks have the usual caveats: they are for highly specific cases which most likely do not match your intended use. So it is better to actually try switching backends in your application than to waste time resurrecting a dead project.

frevib commented 1 year ago

This project is indeed quite stale. But if there are any bugfixes or improvements I can merge them.

The benchmarking results should be reproducible if you use the exact same specs, Linux kernel version, liburing version, benchmark tools, etc. There has been so many changes since then, most likely the benchmark results are different now.

What @Mathnerd314 states seems to be about right: “io_uring can offer significant speedups in some cases”. Some projects benefit, others not at all. Have a look Netty for instance, they see a nice perf increase.