Open lbguilherme opened 3 years ago
Wow, that performance gain is amazing!
I personally think option 2 is the best one because it just works: no need to pass flags or choose anything. But I'm not sure how can we detect it's supported (but that's a burden on us and it's better than a burden on the user. )
Great job!
Yes option 2 is definitely the best. It is possible to detect at runtime and/or compile time by reading /proc/kallsyms
. More information at https://unix.stackexchange.com/questions/596276/how-to-tell-if-a-linux-machine-supports-io-uring
Yes this looks very promising, for eventually getting rid of the libevent2 dependency if nothing else.
However I'd like to hear some more opinions from core how they feel of potentially having to maintain two implementations for quite a while.
If everybody is fine with it I'd vote for runtime detection as the default with compile time flags for force uring or force libevent2
I think option 2 sounds great. But I think it would be good to have for a while a flag to select between auto | libevent | iouring
as a way to know exactly what mechanism is used.
If a program that supports iouring can be used as is in a system without iouring, then runtime check of this flag would be enough and my preferred way.
I think option 2 sounds great. But I think it would be good to have for a while a flag to select between auto | libevent | iouring as a way to know exactly what mechanism is used.
More an environment variable then, rather than a compile time flag?
Yes @j8r. When I said "a flag" I was not thinking of a compile time flag. I should have said env var to be more clear.
I think I would prefer a compile time flag or both. We have a tendency to distribute by source so recompiling is usually easy and compile time flag allows to build binaries that don't depend on libevent2.
This is awesome <3!
I second the idea of being able to choose the interface. With respect to
However I'd like to hear some more opinions from core how they feel of potentially having to maintain two implementations for quite a while.
I understand and share the hesitation, but this is just too good to let it pass. I like to keep compatibility with older releases, so I don't think libevent2 should go in any foreseeable time. Therefore, we should try to keep the two options for some while.
@lbguilherme we welcome enthusiastically your PR, and if you think you can help us with this until it gets stable enough, it would be even more greatly appreciated.
Nice to see other people looking at io_uring!
I began experimenting with replacing LibEvent with IoUring in Crystal
What approach are you using? Are you keeping the poll based approach (ie emitting IORING_OP_POLL_ADD events) or are you working with direct read and writes? The former maps more closely to the current Crystal implementation and abstractions and may be simpler to get working, while the latter is theoretically a bit more efficient (and the resulting implementation can be made cleaner, IMO). The latter also might require a slightly higher kernel version to get going due to support for writing(reading to ttys and pipes was pretty late (5.9 perhaps?).
One part where I haven't seen an obvious way forward are DNS lookups. Do you have any idea of how to implement them in an asynch manner under io_uring? Currently Crystal uses some plugin specially made for libevent or something like that, right?
With LibEvent2 (Crystal 1.0.0 unchanged) Requests/sec: 80758.47
With io_uring (Crystal 1.0.0, with -Dpreview_iouring) Requests/sec: 185189.62
This broadly matches the results I got when I benchmarked normal crystal libevent vs raw io_uring (using uring bindings from https://github.com/yxhuvud/ior) with https://github.com/haraldh/rust_echo_bench to evaluate. I even got a slightly higher difference but that was probably due to http server overhead instead of using raw sockets.
This is more than double the performance! And it can probably be tuned even further.
However, other people that benchmarks tend to get 'only' 5%-50% improvement when comparing socket IO over epoll vs io_uring, so this has led me to believe that most of the improvement comes from not using libevent and the overheads from that, even if io_uring of course helps out too.
io_uring is released as a Shard that does monkey-patching into internal bits of the Crystal module. Its versions would be tied to specific Crystal versions since that module is not protected by SemVer.
One shard that is taking steps on this direction is my https://github.com/yxhuvud/nested_scheduler . Support for io_uring is there but not documented yet as I have yet to figure out some issues with yielding that doesn't fully behave as expected yet. I guess the part most relevant for this discussion is https://github.com/yxhuvud/nested_scheduler/blob/main/src/nested_scheduler/io_uring_context.cr as it shows how well the raw uring ops maps to Crystal operations. One thing to note here is that each underlying thread get a uring context instance of their own, meaning it will use separate io_uring file descriptors for each.
And yes, I know I will have to do major refactorings when the changes straight-shoota is making for Windows support is released :)
References for anyone wanting to play around: raw liburing bindings for crystal: https://github.com/yxhuvud/ior scheduler hack able to use io_uring: https://github.com/yxhuvud/nested_scheduler
Hey @yxhuvud and @lbguilherme , we're still receiving talks for the Crystal Conf 1.0 in case you want to talk about what you did with the IO_Uring interface ;-) ;-) http://man.as/crystal-cft
Nice! I'll be going for a PR introducing it to the stdlib.
I think option 2 sounds great. But I think it would be good to have for a while a flag to select between
auto | libevent | iouring
as a way to know exactly what mechanism is used.
About flags, I don't see much value in a runtime flag given that we can detect support for io_uring at runtime. The only reason to disable it would be in case of bugs. So I propose the following:
preview_iouring
. If enabled, the resulting binary would perform some runtime checks to decide between livevent2 and io_uring at runtime, picking the latter if possible. There should be no user-visible differences, so this is more an attempt to prevent issues.preview_iouring
flag ceases to exist) and all programs perform runtime checks. Maybe we could introduce a CRYSTAL_DISABLE_IO_URING=1
env to force detection to fail. It should be stable at this stage, so I'm not sure if it's needed.libevent2
, relying only on io_uring. The objective is to not link to libevent2
at all. This should never be the default (not for some years at least). force_iouring
or disable_libevent
. This is optional.Seems fine?
What approach are you using? Are you keeping the poll based approach (ie emitting IORING_OP_POLL_ADD events) or are you working with direct read and writes?
For my prototype, I'm doing system calls directly (without liburing) and I'm doing reads/writes with it instead of polling for maximum performance. About multi-threading I started with a single ring shared between threads, but it didn't perform very well. I'll try to optimize the synchronization, but I'll probably move to one ring per thread like you did.
I'm currently working with Linux 5.12, but I'll go down as low as possible trying to work around what I can. I think the minimum will be 5.6, but it could be 5.9 because of the TTY issue you mentioned.
Nice work you did there!
@beta-ziliani I feel things are not yet at a state where I feel I have something to show in time for the conference. Both nested_scheduler itself (as a step towards #6468) and the uring part could be worth a talk eventually but right now neither of them are sufficiently developed to talk about.
I will get a lot of free time to work on it during the summer so hopefully I'll have something eventually, but setting a deadline seems hard at the moment as issues tend to be very hard to diagnose or estimate until I figure them out.
@lbguilherme
I'm doing system calls directly
Impressive. There are enough atomics and other complex things involved (especially in the initialization) that I don't trust myself doing that. I'm planning to figure out how to do an in directory compilation and linking of liburing for ior
eventually, to avoid the need to install stuff as root, but I haven't had the time yet.
The benefit of a runtime flag is having the same binary to switch the underlying event implementation. This can be helpful for doing benchmarks easier but also to allow a deployment to switch how it is working. If there is a bug in the new io_uring it will be easier and handy to turn it off and compare against the current implementation.
Being opt-in and then opt-out is definitely the way to go. It can work as either compile-time or runtime.
I'm looking forward to see https://github.com/crystal-lang/crystal/pull/10768 being merged!
Just my 2c, IMHO the direct syscall strategy followed by @lbguilherme seems positive, since Crystal will not depend on liburing
. I understand that using liburing
would be somewhat easier, but since the heavy-lifting and the hard part is already done (congrats again, Guilherme), now Crystal can benefit from it.
Merge - When? 😋
In response to https://github.com/crystal-lang/crystal/pull/10768#issuecomment-1286932845:
I would very much like to merge ui_uring support. It introduces some additional complexity to make the event loop interface more flexible and especially being able to select the implementation at runtime. But this should be well worth it considering the performance gains it promises.
Also I figure it would be quite impractical to implement this in user code. Without the interface changes it would be impossible to integrate with the event loop. An external implementation would have to use its own dedicated API for that and you'd have to use that explicitly if you want io_uring.
For an efficient review, it would be helpful to split the refactoring of the EventLoop
interface from the io_uring
-based implementation. The refactor just a small part of the changes in #10768, but it would at least reduce the complexity a bit by containing related changes. A benefit of that would be that we can probably merge that relatively quickly. And then focus on polishing and merging the io_uring
implementation.
@lbguilherme WDYT? Would you be able to do that, or do you prefer someone to help with extracting the refactor?
I agree, those are easily splitable and will simplify the review. I'll do that.
It has now been 3 and a half years since the release of kernel 5.6 and WSL2 is now using kernel 5.15 (see here).
Also, Guilherme did a great job on https://github.com/crystal-lang/crystal/pull/12656, leaving https://github.com/crystal-lang/crystal/pull/10768 ready for its final review near its completion.
This is huge and can bring a lot of new users to Crystal, taking advantage of its top class performance for web backends.
AFAIU #10768 is not finished. CI is failing. So that'll need some polish still before being ready to merge.
Sorry. You are right. I updated my previous comment.
According to the wikipedia article, there are regular security issues with io_uring
.
Searching Mitre, there has been 11 CVE in 2022, 14 in 2023, and already 1 for 2024. Google, for example, disabled io_uring
on Android, ChromeOS and GKE Autopilot. Here is what they have to say about io_uring
(June 2023):
While io_uring brings performance benefits, and promptly reacts to security issues with comprehensive security fixes (like backporting the 5.15 version to the 5.10 stable tree), it is a fairly new part of the kernel. As such, io_uring continues to be actively developed, but it is still affected by severe vulnerabilities and also provides strong exploitation primitives. For these reasons, we currently consider it safe only for use by trusted components. https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
So, flags and environment variables to enable/disable io_uring
and detecting whether it's really available (and allowed) seem like a good idea to have :smile:
So, flags and environment variables to enable/disable
io_uring
and detecting whether it's really available (and allowed) seem like a good idea to have
Agree. This is a more prudent approach and, at the same time, does not block the completion of current work.
From a different point of view, it is positive that Crystal has support for io_uring
working (even if under flags) because when it _(io_uring
)_ becomes mature and safe, Crystal will be ready, which could be a competitive advantage.
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/charting-the-route-to-multi-threading-support/7320/1
Since Linux 5.1 a new interface for asynchronous IO was released, making it possible to do IO with zero (or very few) system calls. It works by keeping a lock-free IO submission queue and a lock-free IO completion queue in a memory shared between the kernel and the application (thus reducing context switches). The performance improvements are impressive.
I began experimenting with replacing LibEvent with IoUring in Crystal, and the results were quite good. I'm currently benchmarking with the sample HTTP Server (without
preview_mt
):With LibEvent2 (Crystal 1.0.0 unchanged)
With io_uring (Crystal 1.0.0, with
-Dpreview_iouring
)This is more than double the performance! And it can probably be tuned even further.
But...
io_uring only got support for sockets with Linux 5.6 (May 2020) and this version definitely isn't used everywhere yet. Even worse: WSL2 used a 5.6 kernel with io_uring disabled (for no reason at all).
That said I see three paths going forward:
-Dpreview_iouring
flag. That way only those who can afford it will enable it. The resulting binary wouldn't need to link to libevent2 at all.Crystal
module. Its versions would be tied to specific Crystal versions since that module is not protected by SemVer.Here I would like to understand if options 1 or 2 are desirable (if so, I'll work on a PR).