crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.42k stars 1.62k forks source link

Linux's IO_Uring interface (2x IO performance!) #10740

Open lbguilherme opened 3 years ago

lbguilherme commented 3 years ago

Since Linux 5.1 a new interface for asynchronous IO was released, making it possible to do IO with zero (or very few) system calls. It works by keeping a lock-free IO submission queue and a lock-free IO completion queue in a memory shared between the kernel and the application (thus reducing context switches). The performance improvements are impressive.

I began experimenting with replacing LibEvent with IoUring in Crystal, and the results were quite good. I'm currently benchmarking with the sample HTTP Server (without preview_mt):

require "http/server"

{% if flag?(:preview_iouring) %}
  require "./stdlib_patch"
{% end %}

server = HTTP::Server.new do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world!"
end

address = server.bind_tcp 8080
puts "Listening on http://#{address}"
server.listen

With LibEvent2 (Crystal 1.0.0 unchanged)

$ wrk -t12 -c400 -d30s http://127.0.0.1:8080
Running 30s test @ http://127.0.0.1:8080
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.91ms    2.27ms 207.65ms   71.96%
    Req/Sec     6.77k   749.15    11.23k    79.17%
  2426542 requests in 30.05s, 233.73MB read
Requests/sec:  80758.47
Transfer/sec:      7.78MB

With io_uring (Crystal 1.0.0, with -Dpreview_iouring)

$ wrk -t12 -c400 -d30s http://127.0.0.1:8080
Running 30s test @ http://127.0.0.1:8080
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   653.79us  357.50us  37.98ms   98.11%
    Req/Sec    26.79k    17.49k  185.83k    58.06%
  5574025 requests in 30.10s, 536.90MB read
Requests/sec: 185189.62
Transfer/sec:     17.84MB

This is more than double the performance! And it can probably be tuned even further.

But...

io_uring only got support for sockets with Linux 5.6 (May 2020) and this version definitely isn't used everywhere yet. Even worse: WSL2 used a 5.6 kernel with io_uring disabled (for no reason at all).

That said I see three paths going forward:

  1. io_uring is added to the standard library and is enabled behind a -Dpreview_iouring flag. That way only those who can afford it will enable it. The resulting binary wouldn't need to link to libevent2 at all.
  2. io_uring is added to the standard library and the runtime would detect Kernel support for it at initialization, falling back to libevent2 as needed.
  3. io_uring is released as a Shard that does monkey-patching into internal bits of the Crystal module. Its versions would be tied to specific Crystal versions since that module is not protected by SemVer.

Here I would like to understand if options 1 or 2 are desirable (if so, I'll work on a PR).

asterite commented 3 years ago

Wow, that performance gain is amazing!

I personally think option 2 is the best one because it just works: no need to pass flags or choose anything. But I'm not sure how can we detect it's supported (but that's a burden on us and it's better than a burden on the user. )

Great job!

j8r commented 3 years ago

Yes option 2 is definitely the best. It is possible to detect at runtime and/or compile time by reading /proc/kallsyms. More information at https://unix.stackexchange.com/questions/596276/how-to-tell-if-a-linux-machine-supports-io-uring

jhass commented 3 years ago

Yes this looks very promising, for eventually getting rid of the libevent2 dependency if nothing else.

However I'd like to hear some more opinions from core how they feel of potentially having to maintain two implementations for quite a while.

If everybody is fine with it I'd vote for runtime detection as the default with compile time flags for force uring or force libevent2

bcardiff commented 3 years ago

I think option 2 sounds great. But I think it would be good to have for a while a flag to select between auto | libevent | iouring as a way to know exactly what mechanism is used.

If a program that supports iouring can be used as is in a system without iouring, then runtime check of this flag would be enough and my preferred way.

j8r commented 3 years ago

I think option 2 sounds great. But I think it would be good to have for a while a flag to select between auto | libevent | iouring as a way to know exactly what mechanism is used.

More an environment variable then, rather than a compile time flag?

bcardiff commented 3 years ago

Yes @j8r. When I said "a flag" I was not thinking of a compile time flag. I should have said env var to be more clear.

jhass commented 3 years ago

I think I would prefer a compile time flag or both. We have a tendency to distribute by source so recompiling is usually easy and compile time flag allows to build binaries that don't depend on libevent2.

beta-ziliani commented 3 years ago

This is awesome <3!

I second the idea of being able to choose the interface. With respect to

However I'd like to hear some more opinions from core how they feel of potentially having to maintain two implementations for quite a while.

I understand and share the hesitation, but this is just too good to let it pass. I like to keep compatibility with older releases, so I don't think libevent2 should go in any foreseeable time. Therefore, we should try to keep the two options for some while.

@lbguilherme we welcome enthusiastically your PR, and if you think you can help us with this until it gets stable enough, it would be even more greatly appreciated.

yxhuvud commented 3 years ago

Nice to see other people looking at io_uring!

I began experimenting with replacing LibEvent with IoUring in Crystal

What approach are you using? Are you keeping the poll based approach (ie emitting IORING_OP_POLL_ADD events) or are you working with direct read and writes? The former maps more closely to the current Crystal implementation and abstractions and may be simpler to get working, while the latter is theoretically a bit more efficient (and the resulting implementation can be made cleaner, IMO). The latter also might require a slightly higher kernel version to get going due to support for writing(reading to ttys and pipes was pretty late (5.9 perhaps?).

One part where I haven't seen an obvious way forward are DNS lookups. Do you have any idea of how to implement them in an asynch manner under io_uring? Currently Crystal uses some plugin specially made for libevent or something like that, right?

With LibEvent2 (Crystal 1.0.0 unchanged) Requests/sec: 80758.47

With io_uring (Crystal 1.0.0, with -Dpreview_iouring) Requests/sec: 185189.62

This broadly matches the results I got when I benchmarked normal crystal libevent vs raw io_uring (using uring bindings from https://github.com/yxhuvud/ior) with https://github.com/haraldh/rust_echo_bench to evaluate. I even got a slightly higher difference but that was probably due to http server overhead instead of using raw sockets.

This is more than double the performance! And it can probably be tuned even further.

However, other people that benchmarks tend to get 'only' 5%-50% improvement when comparing socket IO over epoll vs io_uring, so this has led me to believe that most of the improvement comes from not using libevent and the overheads from that, even if io_uring of course helps out too.

io_uring is released as a Shard that does monkey-patching into internal bits of the Crystal module. Its versions would be tied to specific Crystal versions since that module is not protected by SemVer.

One shard that is taking steps on this direction is my https://github.com/yxhuvud/nested_scheduler . Support for io_uring is there but not documented yet as I have yet to figure out some issues with yielding that doesn't fully behave as expected yet. I guess the part most relevant for this discussion is https://github.com/yxhuvud/nested_scheduler/blob/main/src/nested_scheduler/io_uring_context.cr as it shows how well the raw uring ops maps to Crystal operations. One thing to note here is that each underlying thread get a uring context instance of their own, meaning it will use separate io_uring file descriptors for each.

And yes, I know I will have to do major refactorings when the changes straight-shoota is making for Windows support is released :)

References for anyone wanting to play around: raw liburing bindings for crystal: https://github.com/yxhuvud/ior scheduler hack able to use io_uring: https://github.com/yxhuvud/nested_scheduler

beta-ziliani commented 3 years ago

Hey @yxhuvud and @lbguilherme , we're still receiving talks for the Crystal Conf 1.0 in case you want to talk about what you did with the IO_Uring interface ;-) ;-) http://man.as/crystal-cft

lbguilherme commented 3 years ago

Nice! I'll be going for a PR introducing it to the stdlib.

I think option 2 sounds great. But I think it would be good to have for a while a flag to select between auto | libevent | iouring as a way to know exactly what mechanism is used.

About flags, I don't see much value in a runtime flag given that we can detect support for io_uring at runtime. The only reason to disable it would be in case of bugs. So I propose the following:

Seems fine?

What approach are you using? Are you keeping the poll based approach (ie emitting IORING_OP_POLL_ADD events) or are you working with direct read and writes?

For my prototype, I'm doing system calls directly (without liburing) and I'm doing reads/writes with it instead of polling for maximum performance. About multi-threading I started with a single ring shared between threads, but it didn't perform very well. I'll try to optimize the synchronization, but I'll probably move to one ring per thread like you did.

I'm currently working with Linux 5.12, but I'll go down as low as possible trying to work around what I can. I think the minimum will be 5.6, but it could be 5.9 because of the TTY issue you mentioned.

Nice work you did there!

yxhuvud commented 3 years ago

@beta-ziliani I feel things are not yet at a state where I feel I have something to show in time for the conference. Both nested_scheduler itself (as a step towards #6468) and the uring part could be worth a talk eventually but right now neither of them are sufficiently developed to talk about.

I will get a lot of free time to work on it during the summer so hopefully I'll have something eventually, but setting a deadline seems hard at the moment as issues tend to be very hard to diagnose or estimate until I figure them out.

@lbguilherme

I'm doing system calls directly

Impressive. There are enough atomics and other complex things involved (especially in the initialization) that I don't trust myself doing that. I'm planning to figure out how to do an in directory compilation and linking of liburing for ior eventually, to avoid the need to install stuff as root, but I haven't had the time yet.

bcardiff commented 3 years ago

The benefit of a runtime flag is having the same binary to switch the underlying event implementation. This can be helpful for doing benchmarks easier but also to allow a deployment to switch how it is working. If there is a bug in the new io_uring it will be easier and handy to turn it off and compare against the current implementation.

Being opt-in and then opt-out is definitely the way to go. It can work as either compile-time or runtime.

paulocoghi commented 2 years ago

I'm looking forward to see https://github.com/crystal-lang/crystal/pull/10768 being merged!

Just my 2c, IMHO the direct syscall strategy followed by @lbguilherme seems positive, since Crystal will not depend on liburing. I understand that using liburing would be somewhat easier, but since the heavy-lifting and the hard part is already done (congrats again, Guilherme), now Crystal can benefit from it.

ghost commented 2 years ago

Merge - When? 😋

straight-shoota commented 1 year ago

In response to https://github.com/crystal-lang/crystal/pull/10768#issuecomment-1286932845:

I would very much like to merge ui_uring support. It introduces some additional complexity to make the event loop interface more flexible and especially being able to select the implementation at runtime. But this should be well worth it considering the performance gains it promises.

Also I figure it would be quite impractical to implement this in user code. Without the interface changes it would be impossible to integrate with the event loop. An external implementation would have to use its own dedicated API for that and you'd have to use that explicitly if you want io_uring.

straight-shoota commented 1 year ago

For an efficient review, it would be helpful to split the refactoring of the EventLoop interface from the io_uring-based implementation. The refactor just a small part of the changes in #10768, but it would at least reduce the complexity a bit by containing related changes. A benefit of that would be that we can probably merge that relatively quickly. And then focus on polishing and merging the io_uring implementation. @lbguilherme WDYT? Would you be able to do that, or do you prefer someone to help with extracting the refactor?

lbguilherme commented 1 year ago

I agree, those are easily splitable and will simplify the review. I'll do that.

paulocoghi commented 10 months ago

It has now been 3 and a half years since the release of kernel 5.6 and WSL2 is now using kernel 5.15 (see here).

Also, Guilherme did a great job on https://github.com/crystal-lang/crystal/pull/12656, leaving https://github.com/crystal-lang/crystal/pull/10768 ready for its final review near its completion.

This is huge and can bring a lot of new users to Crystal, taking advantage of its top class performance for web backends.

straight-shoota commented 10 months ago

AFAIU #10768 is not finished. CI is failing. So that'll need some polish still before being ready to merge.

paulocoghi commented 10 months ago

Sorry. You are right. I updated my previous comment.

ysbaddaden commented 9 months ago

According to the wikipedia article, there are regular security issues with io_uring. Searching Mitre, there has been 11 CVE in 2022, 14 in 2023, and already 1 for 2024. Google, for example, disabled io_uring on Android, ChromeOS and GKE Autopilot. Here is what they have to say about io_uring (June 2023):

While io_uring brings performance benefits, and promptly reacts to security issues with comprehensive security fixes (like backporting the 5.15 version to the 5.10 stable tree), it is a fairly new part of the kernel. As such, io_uring continues to be actively developed, but it is still affected by severe vulnerabilities and also provides strong exploitation primitives. For these reasons, we currently consider it safe only for use by trusted components. https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

So, flags and environment variables to enable/disable io_uring and detecting whether it's really available (and allowed) seem like a good idea to have :smile:

paulocoghi commented 9 months ago

So, flags and environment variables to enable/disable io_uring and detecting whether it's really available (and allowed) seem like a good idea to have

Agree. This is a more prudent approach and, at the same time, does not block the completion of current work.

From a different point of view, it is positive that Crystal has support for io_uring working (even if under flags) because when it _(io_uring)_ becomes mature and safe, Crystal will be ready, which could be a competitive advantage.

crysbot commented 1 day ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/charting-the-route-to-multi-threading-support/7320/1