RFC: Refactor Crystal::EventLoop to disconnect it from LibEvent

lbguilherme commented 3 years ago

LibEvent is all about detecting when a file/socket is ready to be read or written without blocking. A non-blocking read from a socket is thus a sequence of two steps: first wait for it to be readable, then read from it without blocking. If the system happens to return a EWOULDBLOCK error, just try waiting for it to be readable again. This is the reason IO::Evented exists. For example:

def evented_read(slice : Bytes, errno_msg : String) : Int32
  loop do
    bytes_read = yield slice
    if bytes_read != -1
      # `to_i32` is acceptable because `Slice#size` is an Int32
      return bytes_read.to_i32
    end

    if Errno.value == Errno::EAGAIN
      wait_readable
    else
      raise IO::Error.from_errno(errno_msg)
    end
  end
ensure
  resume_pending_readers
end

The problem is that not all async IO work like that. Windows' Overlapped IO works by submitting a "read request" to the system and them waiting for an event when it completes in a single step. Linux's io_uring is very similar.

It is not viable to implement efficient async IO on Windows or on Linux's io_uring on top of the current IO::Evented interface. Please refer to these comments by ysbaddaden and RX14 as well: https://github.com/crystal-lang/crystal/pull/8651.

I propose changing Crystal::EventLoops interface to this:

module Crystal::EventLoop
  # Runs the event loop.
  def self.run_once : Nil
  end

  {% unless flag?(:preview_mt) %}
    # Reinitializes the event loop after a fork.
    def self.after_fork : Nil
    end
  {% end %}

  # All methods below will block the current fiber until the operation is complete.
  # run_once is responsible for enqueueing fibers that are ready to return.

  # Blocks the current fiber for `time`.
  def self.sleep(time : Time::Span) : Nil
  end

  # Reads at least one byte from a file or pipe
  def self.read(fd, slice : Bytes) : Int32
  end

  # Writes at least one byte to a file or pipe
  def self.write(fd, slice : Bytes) : Int32
  end

  # Reads at least one byte from a socket
  def self.receive(fd, slice : Bytes) : Int32
  end

  # Reads at least one byte from a socket, obtaining the source address of the packet (UDP)
  def self.receive_from(fd, slice : Bytes) : {Int32, ::Socket::Address}
  end

  # Writes at least one byte to a socket
  def self.send(fd, slice : Bytes) : Int32
  end

  # Writes at least one byte to a socket with a target address (UDP)
  def self.send_to(fd, slice : Bytes, address : ::Socket::Address) : Int32
  end

  # Accepts a incomming TCP connection
  def self.accept(fd) : Int32
  end

  # Opens a connection
  def self.connect(fd, address : ::Socket::Address) : Int32
  end

  # Closes a file
  def self.close(fd) : Int32
  end

  # Synchronizes data and metadata of a file to persistent storage
  def self.fsync(fd) : Int32
  end

  # Synchronizes data of a file to persistent storage
  def self.fdatasync(fd) : Int32
  end

  # Waits until a file is ready for a non-blocking read operation
  def self.wait_readable(fd) : Nil
  end

  # Waits until a file is ready for a non-blocking write operation
  def self.wait_writable(fd) : Nil
  end
end

All of these can be implemented for Unix either by blocking or by using wait_readable/wait_writable and calling into LibC when ready. This interface allows other async implementations to exist without changes to files outside Crystal::System.

I believe this makes the Windows implementation cleaner as well. What do you think @straight-shoota?

straight-shoota commented 3 years ago

Are you suggesting to move IO operations directly into the event loop? Not sure that's a wise idea. I think they should stay confined to the respective IO implementations because this tends to be very application-specific. There are differences between IO operations on a socket vs. on a file, even if they technically use the same fd interface and functions.

For the win32 port I think I have this worked out quite well. The IOCP variant of the event loop build the basis, and I introduced an IO::Overlapped module similar to IO::Evented which handles and abstracts the overlapped IO for the socket implementation, and integrates with the event loop. There is some overlap with IO::Evented, though, and I believe some of the interfaces can be more refined.

I understand the integration of io_uring is going to be a bit more complex, because we essentially need to support it together with libevent in a single program. With libevent and iocp this is not an issue, because they're confined to POSIX and win32 platforms, respectively.

yxhuvud commented 3 years ago

# Runs the event loop.
  def self.run_once : Nil
  end

Eh, I would consider run_once to be just an implementation detail of libevent. What would be needed is a loop specific implementation of Scheduler#reschedule. It really is not advisable to view the event loop separately from the scheduler - especially not in the case of io_uring where the same syscall is used for both submitting and fetching events, and you can do both in the same call. (EDIT to clarify: there are certainly some parts of reschedule that still belongs in Scheduler).

I think they should stay confined to the respective IO implementations because this tends to be very application-specific.

Problem is that they simultaneously also end up very event-loop specific. Always mixing everything into the different IO classes is not super great. It could be that the nicest structure available is to have a second object in between what we now call the event loop and have all interactions with the event loop in that.

EDIT: BTW, the refactorings already done for sockets related to the windows port is a good step in the right direction here, it is really nice to have everything collected in the same place, even if it is in a module and thus not so self contained as it could be.

straight-shoota commented 3 years ago

Yeah, the typical system-specific implementations are based on modules, because there has only ever been a single viable strategy for any target triple. But when we need to select between libevent and io_uring at runtime, that no longer works. We need a different strategy then.

I suppose it shouldn't be a big change to go from system_* methods in a module to a dedicated type that encapsulates all the IO operations and event loop interaction. The API wouldn't need to change much for that. This isn't necessarily the only option, but it should be pretty doable.

straight-shoota commented 3 years ago

For reference, this is the relevant event loop API in my IOCP port:

module Crystal::EventLoop
  # Runs the event loop.
  def self.run_once : Nil
  end

  # Reinitializes the event loop after a fork.
  def self.after_fork : Nil
  end

  def self.enqueue(event : Event)
  end

  def self.dequeue(event : Event)
  end

  # Create a new resume event for a fiber.
  def self.create_resume_event(fiber : Fiber) : Crystal::Event
  end
end

struct Crystal::Event
  getter fiber
  getter wake_at

  def initialize(@fiber : Fiber, @wake_at = Time.monotonic)
  end

  # Frees the event
  def free : Nil
  end

  def add(time_span : Time::Span) : Nil
  end
end

ysbaddaden commented 10 months ago

I'm growing fond of the initial proposed idea by @lbguilherme where we treat the event loop as some a blackbox.

This is close to how we treat the GC: allocate N bytes and return a pointer, I don't care how you do it. We can switch the implementation and everything's working as long as the interface is implemented, without leaking any internal details.

It would be great if we could do the same for non-blocking IO operations!

I also like your Event + EventLoop abstraction @straight-shoota. It's simple and concise, and having an actual Event object is likely a good idea!

The problem is that it's leaking implementation details from the EventLoop into the IO objects themselves, so depending on the target-triple we either have IO::Overlapped or IO::Evented that each assume a specific mechanism for the event-loop, which may not stand: on a quick glance io_uring seem closer to kqueue and iocp than epoll for eample.

yxhuvud commented 10 months ago

It would be great if we could do the same for non-blocking IO operations!

One problem is that the set of non blocking operations are not the same everywhere. Meaning the interface would need to either be so pessimal that it may as well not exist or contain almost everything under the sun. There are also some operations that is simply impossible or really hard to implement on certain platforms, like wait_readable or wait_writable on windows. Meanwhile they sortof have to exist on linux in some way as they are the ideomatic way to do a lot of interaction with the operating system and close-to-operating system level stuff.

Agreed on the Event being quite implementation dependent, it is definitely possible to build an event loop that have no reason at all to keep track of that because it is gotten for free by the general functionality of everything, but instead have to keep track of for example received events until they are processed ( example, half-functioning together with nested_scheduler shard. Fiber is monkey patched to keep the received event around).

ysbaddaden commented 10 months ago

Would it be so bad to have N implementations? If there are repetitions could they not be abstracted behind the event loop / nonblock operations, instead of leaking their details?

Replying to myself: let's learn about kqueue and IOCP in addition to epoll :books:

straight-shoota commented 8 months ago

I'm growing fond of the initial proposed idea by @lbguilherme where we treat the event loop as some a blackbox.

Yeah, this is probably the best way forward. It should lead to a pretty straightforward interface. It could grow quite big though, considering that we may need evented implementations for a lot of actions (as mentioned in https://github.com/crystal-lang/crystal/issues/10766#issuecomment-1906404421).

ysbaddaden commented 8 months ago

I'm not sure there are that many methods. I think most are listed in the issue description. The difference being that I'd pass objects (File, IO::FileDescriptor, Socket) instead of fd that would be linux specific.

I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP, io_uring) while others don't (kqueue, epoll) and that we could have both io_uring and epoll in the same application, so the default File#blocking would depend on the current event loop object.

straight-shoota commented 8 months ago

The difference being that I'd pass objects (File, IO::FileDescriptor, Socket) instead of fd that would be linux specific.

Yes, that would probably be necessary for a generic API. It also allows specialising implementations based on the type of IO.

I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP, io_uring) while others don't (kqueue, epoll) and that we could have both io_uring and epoll in the same application, so the default File#blocking would depend on the current event loop object.

That could be quite nasty. It would probably be better to have a standardized default.

yxhuvud commented 8 months ago

I'm more concerned about some syscalls allowing nonblocking IO over regular disk files (IOCP, io_uring) while others don't (kqueue, epoll) and that we could have both io_uring and epoll in the same application, so the default File#blocking would depend on the current event loop object.

I don't know about windows, but File#blocking is not an issue for disk IO on Linux, even under io_uring. This is because O_NONBLOCK literally has no effect on block based IO. Blocking as a term in Linux is based on something being either readable or writable, and block based IO is generally speaking always that. There may be some cases where it isn't the case (if the device is not mounted?) but as a rule it will be. It will be in a way that still take a (comparatively speaking) really long time to execute, as execution will still involve requests hitting the actual device, but it will not respect any O_NONBLOCK flags.

O_NONBLOCK is actually an issue when working with sockets though, as any read or write will behave differently - and that includes reads through io_uring. To make io_uring work as an event loop there is two options, either you simply emits reads and writes directly, and the kernel can handle everything around it and have it emit an event when done with the operation, OR the ring can be used very similar to how epoll is used - you set up a poll waiting for readable/writable, and then you emit the operation only when it is writable. The former way is simpler to work with in many ways and have less overhead and less issues (no thundering herd issues, for example), but does not work with O_NONBLOCK - if it is set then read will just return EAGAIN and you'll have to fall back to the poll operation to handle it.

Handling asynchronous disk based IO on only some systems may have a bunch of other issues coming with it though (what happens if the file is closed while request in flight? etc).

ysbaddaden commented 8 months ago

Even if we have 2 event-loops compiled, there should only be one running: we compile both io_uring and epoll because the former may not be available or be disabled (RHEL).

Now, it seems that O_NONBLOCK is dependent on the syscall used by the event loop. We need it for kqueue and epoll (wait readable/writable) but we don't want it for io_uring (wait for completion), so I guess that means open methods on the event loop? Windows would set OVERLAPPED (can only be set at creation), kqueue/epoll set O_NONBLOCK (can be changed anytime) while io_uring won't.

And... we might want to revisit #14255 before we release 1.12 —it might not be a good idea to expose blocking on File for the time being.

Handling asynchronous disk based IO on only some systems may have a bunch of other issues coming with it though (what happens if the file is closed while request in flight? etc).

I can't think of anything specific aside from a regular concurrency issue: the request will succeed or fail (when async the resumed fiber shall raise the error).

straight-shoota commented 8 months ago

I just noticed Proces#wait is also on the event loop with IOCP.

straight-shoota commented 8 months ago

Currently, the libevent-based loop manages a single event object per file descriptor and thread for read/write events.

https://github.com/crystal-lang/crystal/blob/eb743de5691300d1b25d211142be46ec8bf90bff/src/io/evented.cr#L15-L16

Fibers are added to waiting lists (one for each event type). And when a fd is ready, the first waiting fiber gets scheduled.

According to Libevent documentation, it's fine to have multiple events per fd. And we're partially doing that when its used from different threads. But I'm wondering why we're multiplexing per thread instead of having a separate event for each fiber. I think that would remove quite a bit of complexity around managing fiber queues as thread local data. This topic was previously mentioned in https://github.com/crystal-lang/rfcs/pull/2#discussion_r1480522092

I suppose pushing more events for management to the event loop backend might have some performance impact. But then this is only really relevant when you're using the same file descriptor from multiple fibers concurrently. I don't expect this to be very common as it poses synchronization issues. Concurrency can also be managed by having one fiber do the IO operations and have it communicate per channel with producers/consumers.

yxhuvud commented 8 months ago

But I'm wondering why we're multiplexing per thread instead of having a separate event for each fiber.

This is to protect us from having thundering herd issues. So what happens if you have many events set up on a file descriptor and the fd is activated, is that ALL WAITERS on that file descriptor is waken up. Meaning that if you have 10 fibers waiting on a fd, and get an event, all of them are waken up, even if only one of them can process anything. Then all of them have to set up another poll on the fd, adding at least 18 extra syscalls than necessary. This will end up having the effect that for many loads you will spend a significant overhead.

I don't expect this to be very common as it poses synchronization issues

For read and write, yes - sometimes depending on the sizes you work with. Not so for accept and an easy way to set up a server that provides backpressure (ie that is resistant against traffic spikes) is to pre-spawn a bunch of worker fibers that then each do accept on a single socket in a loop.

Concurrency can also be managed by having one fiber do the IO operations and have it communicate per channel with producers/consumers

Channels are slow and introduce a lot of overhead. Or at least that is the traditional answer. Perhaps hardware has improved enough to challenge old truths.

straight-shoota commented 8 months ago

I'm really just wondering if this more complex setup is reasonable, when typically there will only be one fiber waiting per read or write event. Feels like we could avoid a lot of overhead without it.

And I suppose we could have more chances at working around this problem if we dump libevent and directly interact with the OS libraries which should provide more options on this (at least epoll does with EPOLLONESHOT and EPOLLEXCLUSIVE).

accept is a good point, though. However, a reasonable workaround for this scenario could be for every listening fiber to bind its own socket with SO_REUSEPORT.

Btw. I found this blog post on the topic: https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12 (it even includes a video from the engineer guy)

ysbaddaden commented 8 months ago

If I read the epoll manpage correctly a fd can only appear once in an epoll instance, which means the event loop is in charge of supporting multiple events for one fd. I thought kqueue also worked that way, but reading the manpage again, I'm not sure anymore.

At least for epoll the ball is mostly in libevent2, and documentation doesn't state anything. Reading the code... isn't helping. My guess is that it wakes up everything, which would be another good reason to implement our own event loops :grin:

Note: epoll has a flag (EPOLLEXCLUSIVE) since Linux 4.5 (2017) to avoid thundering herds when there are multiple epoll instances waiting on the same fd to only trigger one instance (of course libevent doesn't use it :sob:). Again, I thought kqueue also had this, but it looks like I was mistaken :shrug:

straight-shoota commented 8 months ago

It seems we cannot use the same handle in multiple completion ports with IOCP.

https://learn.microsoft.com/en-us/windows/win32/fileio/createiocompletionport

A handle can be associated with only one I/O completion port, and after the association is made, the handle remains associated with that I/O completion port until it is closed.

If we wanted one port per thread, we'd need to dup the handle.

That doesn't mean we can't have separate event loops per thread. It just means that the IOCP implementation needs to use a single, global completion port in order to be able to use handles across threads.

UPDATE: At least that's what I'm understing about it. Can't say I'm 100% confident.

ysbaddaden commented 8 months ago

How does it work today with preview_mt where we have one event loop per thread? Maybe the windows port doesn't support preview_mt?

straight-shoota commented 8 months ago

Yeah, I'm wondering about that as well. preview_mt generally works on Windows. But maybe nobody has tried using the same socket from multiple threads yet? 🤷

straight-shoota commented 6 months ago

I have created an RFC draft in the dedicated repository: https://github.com/crystal-lang/rfcs/pull/7 Let's continue this discussion there.

crystal-lang / crystal

RFC: Refactor Crystal::EventLoop to disconnect it from LibEvent #10766