Open lbguilherme opened 3 years ago
Are you suggesting to move IO operations directly into the event loop? Not sure that's a wise idea. I think they should stay confined to the respective IO implementations because this tends to be very application-specific. There are differences between IO operations on a socket vs. on a file, even if they technically use the same fd interface and functions.
For the win32 port I think I have this worked out quite well. The IOCP variant of the event loop build the basis, and I introduced an IO::Overlapped
module similar to IO::Evented
which handles and abstracts the overlapped IO for the socket implementation, and integrates with the event loop.
There is some overlap with IO::Evented
, though, and I believe some of the interfaces can be more refined.
I understand the integration of io_uring is going to be a bit more complex, because we essentially need to support it together with libevent in a single program. With libevent and iocp this is not an issue, because they're confined to POSIX and win32 platforms, respectively.
# Runs the event loop.
def self.run_once : Nil
end
Eh, I would consider run_once
to be just an implementation detail of libevent. What would be needed is a loop specific implementation of Scheduler#reschedule
. It really is not advisable to view the event loop separately from the scheduler - especially not in the case of io_uring where the same syscall is used for both submitting and fetching events, and you can do both in the same call. (EDIT to clarify: there are certainly some parts of reschedule
that still belongs in Scheduler
).
I think they should stay confined to the respective IO implementations because this tends to be very application-specific.
Problem is that they simultaneously also end up very event-loop specific. Always mixing everything into the different IO classes is not super great. It could be that the nicest structure available is to have a second object in between what we now call the event loop and have all interactions with the event loop in that.
EDIT: BTW, the refactorings already done for sockets related to the windows port is a good step in the right direction here, it is really nice to have everything collected in the same place, even if it is in a module and thus not so self contained as it could be.
Yeah, the typical system-specific implementations are based on modules, because there has only ever been a single viable strategy for any target triple. But when we need to select between libevent and io_uring at runtime, that no longer works. We need a different strategy then.
I suppose it shouldn't be a big change to go from system_*
methods in a module to a dedicated type that encapsulates all the IO operations and event loop interaction. The API wouldn't need to change much for that.
This isn't necessarily the only option, but it should be pretty doable.
For reference, this is the relevant event loop API in my IOCP port:
module Crystal::EventLoop
# Runs the event loop.
def self.run_once : Nil
end
# Reinitializes the event loop after a fork.
def self.after_fork : Nil
end
def self.enqueue(event : Event)
end
def self.dequeue(event : Event)
end
# Create a new resume event for a fiber.
def self.create_resume_event(fiber : Fiber) : Crystal::Event
end
end
struct Crystal::Event
getter fiber
getter wake_at
def initialize(@fiber : Fiber, @wake_at = Time.monotonic)
end
# Frees the event
def free : Nil
end
def add(time_span : Time::Span) : Nil
end
end
I'm growing fond of the initial proposed idea by @lbguilherme where we treat the event loop as some a blackbox.
This is close to how we treat the GC: allocate N bytes and return a pointer, I don't care how you do it. We can switch the implementation and everything's working as long as the interface is implemented, without leaking any internal details.
It would be great if we could do the same for non-blocking IO operations!
I also like your Event + EventLoop abstraction @straight-shoota. It's simple and concise, and having an actual Event
object is likely a good idea!
The problem is that it's leaking implementation details from the EventLoop into the IO objects themselves, so depending on the target-triple we either have IO::Overlapped or IO::Evented that each assume a specific mechanism for the event-loop, which may not stand: on a quick glance io_uring
seem closer to kqueue
and iocp
than epoll
for eample.
It would be great if we could do the same for non-blocking IO operations!
One problem is that the set of non blocking operations are not the same everywhere. Meaning the interface would need to either be so pessimal that it may as well not exist or contain almost everything under the sun. There are also some operations that is simply impossible or really hard to implement on certain platforms, like wait_readable or wait_writable on windows. Meanwhile they sortof have to exist on linux in some way as they are the ideomatic way to do a lot of interaction with the operating system and close-to-operating system level stuff.
Agreed on the Event
being quite implementation dependent, it is definitely possible to build an event loop that have no reason at all to keep track of that because it is gotten for free by the general functionality of everything, but instead have to keep track of for example received events until they are processed ( example, half-functioning together with nested_scheduler shard. Fiber is monkey patched to keep the received event around).
Would it be so bad to have N implementations? If there are repetitions could they not be abstracted behind the event loop / nonblock operations, instead of leaking their details?
Replying to myself: let's learn about kqueue and IOCP in addition to epoll :books:
I'm growing fond of the initial proposed idea by @lbguilherme where we treat the event loop as some a blackbox.
Yeah, this is probably the best way forward. It should lead to a pretty straightforward interface. It could grow quite big though, considering that we may need evented implementations for a lot of actions (as mentioned in https://github.com/crystal-lang/crystal/issues/10766#issuecomment-1906404421).
I'm not sure there are that many methods. I think most are listed in the issue description. The difference being that I'd pass objects (File, IO::FileDescriptor, Socket) instead of fd
that would be linux specific.
I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP, io_uring
) while others don't (kqueue, epoll) and that we could have both io_uring
and epoll
in the same application, so the default File#blocking would depend on the current event loop object.
The difference being that I'd pass objects (File, IO::FileDescriptor, Socket) instead of
fd
that would be linux specific.
Yes, that would probably be necessary for a generic API. It also allows specialising implementations based on the type of IO.
I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP,
io_uring
) while others don't (kqueue, epoll) and that we could have bothio_uring
andepoll
in the same application, so the default File#blocking would depend on the current event loop object.
That could be quite nasty. It would probably be better to have a standardized default.
I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP, io_uring) while others don't (kqueue, epoll) and that we could have both io_uring and epoll in the same application, so the default File#blocking would depend on the current event loop object.
I don't know about windows, but File#blocking
is not an issue for disk IO on Linux, even under io_uring. This is because O_NONBLOCK
literally has no effect on block based IO. Blocking as a term in Linux is based on something being either readable or writable, and block based IO is generally speaking always that. There may be some cases where it isn't the case (if the device is not mounted?) but as a rule it will be. It will be in a way that still take a (comparatively speaking) really long time to execute, as execution will still involve requests hitting the actual device, but it will not respect any O_NONBLOCK flags.
O_NONBLOCK is actually an issue when working with sockets though, as any read or write will behave differently - and that includes reads through io_uring. To make io_uring work as an event loop there is two options, either you simply emits reads and writes directly, and the kernel can handle everything around it and have it emit an event when done with the operation, OR the ring can be used very similar to how epoll
is used - you set up a poll waiting for readable/writable, and then you emit the operation only when it is writable. The former way is simpler to work with in many ways and have less overhead and less issues (no thundering herd issues, for example), but does not work with O_NONBLOCK - if it is set then read will just return EAGAIN and you'll have to fall back to the poll operation to handle it.
Handling asynchronous disk based IO on only some systems may have a bunch of other issues coming with it though (what happens if the file is closed while request in flight? etc).
Even if we have 2 event-loops compiled, there should only be one running: we compile both io_uring and epoll because the former may not be available or be disabled (RHEL).
Now, it seems that O_NONBLOCK is dependent on the syscall used by the event loop. We need it for kqueue and epoll (wait readable/writable) but we don't want it for io_uring (wait for completion), so I guess that means open
methods on the event loop? Windows would set OVERLAPPED (can only be set at creation), kqueue/epoll set O_NONBLOCK (can be changed anytime) while io_uring won't.
And... we might want to revisit #14255 before we release 1.12 —it might not be a good idea to expose blocking
on File for the time being.
Handling asynchronous disk based IO on only some systems may have a bunch of other issues coming with it though (what happens if the file is closed while request in flight? etc).
I can't think of anything specific aside from a regular concurrency issue: the request will succeed or fail (when async the resumed fiber shall raise the error).
I just noticed Proces#wait
is also on the event loop with IOCP.
Currently, the libevent-based loop manages a single event object per file descriptor and thread for read/write events.
Fibers are added to waiting lists (one for each event type). And when a fd is ready, the first waiting fiber gets scheduled.
According to Libevent documentation, it's fine to have multiple events per fd. And we're partially doing that when its used from different threads. But I'm wondering why we're multiplexing per thread instead of having a separate event for each fiber. I think that would remove quite a bit of complexity around managing fiber queues as thread local data. This topic was previously mentioned in https://github.com/crystal-lang/rfcs/pull/2#discussion_r1480522092
I suppose pushing more events for management to the event loop backend might have some performance impact. But then this is only really relevant when you're using the same file descriptor from multiple fibers concurrently. I don't expect this to be very common as it poses synchronization issues. Concurrency can also be managed by having one fiber do the IO operations and have it communicate per channel with producers/consumers.
But I'm wondering why we're multiplexing per thread instead of having a separate event for each fiber.
This is to protect us from having thundering herd issues. So what happens if you have many events set up on a file descriptor and the fd is activated, is that ALL WAITERS on that file descriptor is waken up. Meaning that if you have 10 fibers waiting on a fd, and get an event, all of them are waken up, even if only one of them can process anything. Then all of them have to set up another poll on the fd, adding at least 18 extra syscalls than necessary. This will end up having the effect that for many loads you will spend a significant overhead.
I don't expect this to be very common as it poses synchronization issues
For read
and write
, yes - sometimes depending on the sizes you work with. Not so for accept
and an easy way to set up a server that provides backpressure (ie that is resistant against traffic spikes) is to pre-spawn a bunch of worker fibers that then each do accept
on a single socket in a loop.
Concurrency can also be managed by having one fiber do the IO operations and have it communicate per channel with producers/consumers
Channels are slow and introduce a lot of overhead. Or at least that is the traditional answer. Perhaps hardware has improved enough to challenge old truths.
I'm really just wondering if this more complex setup is reasonable, when typically there will only be one fiber waiting per read or write event. Feels like we could avoid a lot of overhead without it.
And I suppose we could have more chances at working around this problem if we dump libevent and directly interact with the OS libraries which should provide more options on this (at least epoll
does with EPOLLONESHOT
and EPOLLEXCLUSIVE
).
accept
is a good point, though.
However, a reasonable workaround for this scenario could be for every listening fiber to bind its own socket with SO_REUSEPORT
.
Btw. I found this blog post on the topic: https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12 (it even includes a video from the engineer guy)
If I read the epoll
manpage correctly a fd
can only appear once in an epoll
instance, which means the event loop is in charge of supporting multiple events for one fd
. I thought kqueue
also worked that way, but reading the manpage again, I'm not sure anymore.
At least for epoll
the ball is mostly in libevent2, and documentation doesn't state anything. Reading the code... isn't helping. My guess is that it wakes up everything, which would be another good reason to implement our own event loops :grin:
Note: epoll
has a flag (EPOLLEXCLUSIVE
) since Linux 4.5 (2017) to avoid thundering herds when there are multiple epoll instances waiting on the same fd
to only trigger one instance (of course libevent doesn't use it :sob:). Again, I thought kqueue
also had this, but it looks like I was mistaken :shrug:
It seems we cannot use the same handle in multiple completion ports with IOCP.
https://learn.microsoft.com/en-us/windows/win32/fileio/createiocompletionport
A handle can be associated with only one I/O completion port, and after the association is made, the handle remains associated with that I/O completion port until it is closed.
If we wanted one port per thread, we'd need to dup the handle.
That doesn't mean we can't have separate event loops per thread. It just means that the IOCP implementation needs to use a single, global completion port in order to be able to use handles across threads.
UPDATE: At least that's what I'm understing about it. Can't say I'm 100% confident.
How does it work today with preview_mt
where we have one event loop per thread? Maybe the windows port doesn't support preview_mt
?
Yeah, I'm wondering about that as well. preview_mt
generally works on Windows. But maybe nobody has tried using the same socket from multiple threads yet? 🤷
I have created an RFC draft in the dedicated repository: https://github.com/crystal-lang/rfcs/pull/7 Let's continue this discussion there.
LibEvent
is all about detecting when a file/socket is ready to be read or written without blocking. A non-blocking read from a socket is thus a sequence of two steps: first wait for it to be readable, then read from it without blocking. If the system happens to return aEWOULDBLOCK
error, just try waiting for it to be readable again. This is the reasonIO::Evented
exists. For example:The problem is that not all async IO work like that. Windows' Overlapped IO works by submitting a "read request" to the system and them waiting for an event when it completes in a single step. Linux's io_uring is very similar.
It is not viable to implement efficient async IO on Windows or on Linux's io_uring on top of the current
IO::Evented
interface. Please refer to these comments by ysbaddaden and RX14 as well: https://github.com/crystal-lang/crystal/pull/8651.I propose changing
Crystal::EventLoop
s interface to this:All of these can be implemented for Unix either by blocking or by using
wait_readable
/wait_writable
and calling into LibC when ready. This interface allows other async implementations to exist without changes to files outsideCrystal::System
.I believe this makes the Windows implementation cleaner as well. What do you think @straight-shoota?