Closed straight-shoota closed 1 year ago
I was asking in the parent issue if libuv
could be used, with the thread-safe uv_async.
uv_async
isn't useful
@straight-shoota Crystal::Event
and Crystal::EventLoop
should go, and be internal to Crystal::System
. There needs to be a concept of starting the event loop, and resuming the event loop when there's no work to do, but read/write/other blocking operations that submit events to the event loop should be inside Crystal::System
with no fixed API. Attempting to provide a unified event loop abstraction to the outside world doesn't seem worthwhile to me, instead just provide a blocking IO abstraction and make the suspension and resumption of fibers entirely internal and entirely different internally on each platform.
I'm not sure how this fits in with @ysbaddaden's work, and it shouldn't be neecesary to port fibers. Fibers on windows should be working with channels and merged into master before any of the nonblocking IO stuff is even thought about. They are seperate concerns.
To be clear: I'm proposing Crystal::Event
and Crystal::EventLoop
and all their usages are moved into Crystal::System
such that windows can pursue an entirely different solution, probably not based on callbacks at all, instead directly resuming fibers from the event loop.
But this very much depends on the scheduler design from @ysbaddaden which I haven't seen. I'm pretty in the dark on the design for multicore which @ysbaddaden is proposing and how that fits in with my and @bcardiff's work. This is why I propose not porting evented IO at all on windows yet. It may end up being counterproductive and chasing a moving target.
If we can port fibers without event, that's totally fine by me. Then we just need to further refactor Fiber
and Scheduler
so that a platform-specific implementation for doesn't depend on Crystal::Event
?
@straight-shoota you can stub it out for now, the only connection between the scheduler and the event loop is Crystal::EventLoop.resume
, sleep
which is unused on windows and yield
. I'm not 100% sure but I think yield
can be modeled as exactly the same as enqueue if the scheduler isn't present.
I advise you to do these tweaks via {% if flag
in scheduler.cr, not to add any more Crystal::System
for the scheduler.
Some notes:
Fiber
: only depends on anonymous memory maps (posix: mmap, win32: file mapping) for allocating stacks, and arch-specific context creation/swap functions —warning: context is arch+os specific on win32;
Crystal::EventLoop
: the stdlib has few expections that can be changed for something more abstract (add a file read resume event, add a socket write resume event, ...). The monothreaded Crystal::Scheduler
only cares about resuming the event loop fiber (blocking). My multithreaded Crystal::Scheduler
expects a safe way to run the event loop (no fiber resume) once and nonblocking (with libevent: event_base_loop(base, EVLOOP_ONCE | EVLOOP_NONBLOCK)
);
Crystal::Scheduler
: depends on Fiber to be correctly implemented, and resumes the event loop when its queue is empty; until win32 has a working event loop, maybe it could just exit (nothing to do);
Thread
, Thread::Mutex
and Thread::ConditionVariable
: they're not required to implement Fiber
, but I'd still encourage to have real or just skip implementations instead of adding more stubs.
I have a few more changes pending that I'd like to push:
Fiber::Context
struct for holding the stack top pointer
and a resumable
flag (preventing a double resume of a fiber), and change the context functions accordingly. Maybe later it could contain some bytes to save the current CPU registers —for a Crystal GC stopping the world;Fiber::StackPool
struct (mutex based).I got it working so far. At least in theory. I still need to get the stack swap on win32...
add a file read resume event, add a socket write resume event
That doesn't work on windows, because you're not waiting for an FD to become readable or writable, you're waiting for a specific read or write IO to finish. You then need to resume the specific fiber that sent that IO.
event_base_loop(base, EVLOOP_ONCE | EVLOOP_NONBLOCK)
Where is the blocking sleep when there's nothing to do performed then?
maybe it could just exit (nothing to do);
agreed, probably exit with a warning since with just fibers and channels it should never happen? I haven't proved it to myself but it seems logical.
I have a few more changes pending that I'd like to push
They look like good changes.
That doesn't work on windows
Oh, then the event loop is target specific. That's wonderful. I still wish we could try libevent (at least for timers and sockets), until we dig for arch specifics (IOCP, kqueue and epoll)
Where is the blocking sleep when there's nothing to do performed then?
Threads will spin trying to steal fibers / run the event loop, then give up and park themselves, unless it's the last thread, which should run the event loop as blocking; along with a mechanism to wake parked threads when fibers are enqueued (i.e. mutex + condition variable).
I still wish we could try libevent
I do too, but unfortunately pipes are pretty essential to everything from basic process spawning to signal handing. (yes, windows has signals two)
which should run the event loop as blocking;
ah, so blocking/nonblocking is optional, makes sense
FWIW libevent does mention "IOCP" though with scant documentation, seemingly: https://stackoverflow.com/questions/8042796/libev-on-windows ... hmm might not be enough...
@RX14 and @ysbaddaden Regarding the Windows event loop I think this is very interesting: https://github.com/piscisaureus/wepoll and just wanted to leave it here as a reference before I forget it.
In Rust mio is the de facto standard event loop library and they very recently switched over to a solution inspired by this work. The are several advantages besides a familiar api, one of them beeing performance since the only other way I know of needs extra allocations on Windows due to IOCP requiring a read/write buffer for each event.
It's at least worth considering if the alternative is to implement our own IOCP implementation since that will be a big task anyway.
Wonder if we could just start with select+libevent and then move to IOCP...to save time. But then again maybe too painful to do everythinig twice...or those others might be interesting as well.
wepoll is limited only to sockets
which limits it to being just an optimization once there's an eventloop architecture which can handle the IOCP model.
I'd rather make something that works for the most general case of readable/writable handles then optimize it for sockets later, instead of make something which works for sockets then leave IO to block on every other kind of file until someone gets round to fixing it (which would mean refactoring the event loop, which probably means nobody will get around to it which means hell)
You're right, it's unfortunately only useful for sockets it seems. On investigating this closer I also realized that the use of IOCTL_AFD_POLL
seems to be undocumented and might not work in the future which probably should raise concerns if used in Crystals stdlib.
@rdp I think designing the event loop architecture with IOCP in mind from the start is the right thing to do. I would actually consider designing it with IOCP in mind first, getting the readiness based models like kqueue and epoll to work with that is probably easier than the other way around. However, everything is possible.
@cfsamson yeah, I thought getting epoll to work like IOCB is easier than the other way around too
after all the "you need to allocate less buffers when using epoll" argument is moot when using crystal's IO model: you need to allocate them anyway since it emulates blocking IO with greenthreads.
Windows' IO model is essentially submitting a buffer and the OS tells you when it's done filling it with data and how much. This is easily mapped to Crystal, and epoll is easily mapped to that (we already do it, just at a higher layer).
@RX14 I'm going out on a limb here partly since it might contribute to the discussion, and partly out of curiosity. I made an extremely simplified model to just plot down how something like this could work (if the plan is to abstract at a higher level like sockets/files/pipes to hide the implementation details). If I understand you correctly the green thread model greatly simplifies the event loop implementation since you can easily prevent the buffer sent to IOCP from being touched while waiting for the event to complete and you will not have any "extra" allocations since this will be abstracted over in either case:
I apologize in advance for simplyfying this so much that the code is not valid anything really and skipping a ton of complexity.
@cfsamson note that all reads go through the IO primitive read()
, which already requires an explicit buffer to be passed in, so you'd register that buffer with the event loop directly. gets
and co already allocate buffers and use read()
. That means the amount of allocation does not change, and buffers must already be per-fiber.
libevent supports threaded IOCP (practically the only examples I could find here: https://github.com/libevent/libevent/blob/master/event_iocp.c https://github.com/libevent/libevent/blob/master/sample/http-server.c https://github.com/libevent/libevent/blob/master/test/regress_iocp.c#L304 (the *ptr
passed in is a "basic_test_data" object with a socket pair established).
http://www.wangafu.net/~nickm/libevent-book/Ref6_bufferevent.html) but documentation is super scarce...it almost looks unmaintained...either that or it's bug free? :)
Also interesting is that https://github.com/libevent/libevent/blob/master/event_iocp.c (the entire "libevent IOCP implementation") isn't that long, maybe a good pattern. go's is seems reasonably small too: https://golang.org/src/runtime/netpoll_windows.go?h=iocp and https://golang.org/src/net/fd_windows.go#205 line 205 FWIW. Maybe that's all :)
libuv also supports IOCP but...I could hardly see any examples anywhere...also it seems libuv requires one "event loop" per thread, wasn't sure how that lined up with crystal's current use of libevent...
I've been thinking about this quite a lot (since I'm investigating something related). Creating our own event queue is doable. It's a handfull of syscalls to use on linux/bsd/windows, but this is only part of the problem and we should consider the next steps as well. Here are some of the questions I think needs some discussion and my initial thoughts as well:
We can implement a simple Reactor
backed by this epoll/kqueue/IOCP
based queue which is meant to run on a single separate reactor thread (OS thread), and then follow an epoll
based approach from there (for a lack of a better description) where the Reactor
wakes the green thread which is ready to progress with a task. This means the Reactor
needs a way to communicate with the Scheduler
to mark the relevant green thread as ready to progress.
The next part is how we register events. My initial thoughts here are that we implement a Registrator
which can be sent to different OS threads which is tied to the epoll/kqueue/IOCP
based event queue. This passes on a resource handle, a buffer and a flag to indicate Readable/Writable
interests. It also means that we need to make this Registrator
thread safe. Registrator
would be an implementation detail used in the abstraction over for example Sockets
where we register an interest with the Reactor
and then suspends the green thread.
Since these are most often cached by the OS (and have poor cross platform API's AFAIK) these are most often sent to a thread pool. Anyway, I think we need a cross platform thread pool up and running as well to be able to actually use this in i.e. a web server.
This is a bit tricky I think due to synchronization issues and performance. If we wont to avoid actively polling a queue we need a way to interact with the scheduler from the reactor thread. I don't know how well the current Scheduler
implementation allows for this or the best way to solve this yet.
I'm just putting this thoughts here for now to see if it can contribute to a constructive discussion.
The interface would be to submit a file descriptor for a read/write to the scheduler, and the scheduler would resume your fiber when it's done. The rest is a platform-specific black box. On existing platforms it'd use the same (refactored) libevent code it always has, just moved out of IO::Evented
. On windows, it'd use IOCP. There's no need to share code at higher-granularity for an initial implementation, so KISS.
IIRC with IOCP you can register a void* of data with your read/write, this would simply be the Fiber
pointer to resume, so the event loop would just block waiting for events, and directly receive the fiber pointer to resume from the OS.
That's why I proposed the custom event loop for windows - the only hard part is handling the sleep events.
Oh, I see.
Yes, you associate a token (or a pointer) when registering a resource with the completion port in CreateIoCompletionPort
(you associate it with the resource and not as a part of the event WSARecv/WSASend
). This token is "returned" when retrieving an event using either GetQueuedCompletionStatus
(wait for one event at a time) or as part of the OVERLAPPED_ENTRY
structure when using GetQueuedCompletionStatusEx
to get multiple events.
You still need to actually do the blocking wait for events in a separate thread so that would be the Reactor
part I suggested, or some variant of that inside the black box.
Sleep events is tricky. I've tried something like that before and kept an ordered queue of timers (I used a BTreeMap
) which I checked every time the event queue receives an event (or times out) for expired timers.
Every blocking call to GetQueuedCompletionStatus(Ex)
uses the closest timer as a timeout. If a new timer is registered which is earlier than the previous registered timers, I update the timeout by posting an empty completion packet using PostQueuedCompletionStatus
thereby forcing the event queue to wake up and update its timeout with the new value.
I don't know if there is a better way to do this since it's not a pretty solution. There probably is.
Yes, you associate a token (or a pointer) when registering a resource with the completion port in
CreateIoCompletionPort
(you associate it with the resource and not as a part of the eventWSARecv/WSASend
). This token is "returned" when retrieving an event using eitherGetQueuedCompletionStatus
(wait for one event at a time) or as part of theOVERLAPPED_ENTRY
structure when usingGetQueuedCompletionStatusEx
to get multiple events.
Ah yes, I'd forgotten the details. This is exactly the same as we currently have in libevent's implementation then, where we register one Event
per read/write of a given FD. Then you receive a read/write event for a given file descriptor, and you're passed a handle to the IO::Evented
instance. Then the IO::Evented
instance works out exactly what fiber to resume, given there's been an event. Here is the impl.
You still need to actually do the blocking wait for events in a separate thread so that would be the
Reactor
part I suggested, or some variant of that inside the black box.
The interface is Crystal::EventLoop.run_once
, in the same file i showed above. This would simply:
GetQueuedCompletionStatusEx
with the calculated interval as the timeoutIf a new timer is registered which is earlier than the previous registered timers, I update the timeout by posting an empty completion packet using
PostQueuedCompletionStatus
thereby forcing the event queue to wake up and update its timeout with the new value.
This is all abstracted a bit above the event loop in crystal, at the scheduler level. Event loops are per-thread, not per-program, and one thread can only have timers registered on it's event loop from that thread, meaning you never have to deal with the case of a timer being registered while you're sleeping. Fibers can then be passed between threads by pipes (which generate a read event). I'm glad this is solvable if that situation changes though. Might want to look through scheduler.cr
to get an idea.
That actually simplifies things even more. I'll have to get to know the scheduler
and EventLoop
a bit more to make sure I understand correctly.
I would start by adding bindings for the relevant syscalls and provide some wrappers around them to make them easier to use. I suggest that Event
essentially is just an alias for the OVERLAPPED_ENTRY
on Windows which will return with a pointer to the relevant IO::Evented
instance in the lp_completion_key
field once an event has completed. This should allow us to set up some basic infrastructure in event_loop_iocp.cr.
I'll see if I have some time after my current project is done and see if I can help progress this. Is there a stdcall
directive in Crystal for working with WinApi or is there another way that is solved?
The above suggestion of using the lp_completion_key
field might not work since it's registered on a per resource basis. I can't see a way to actually identify what event has occurred by using that to store a pointer to IO::evented
.
Judging by the BOOST ASIO implementation they seem to not use CompletionKey
at all except when posting "custom" completion packets to PostQueuedCompletionStatus
like in the case of timers.
Instead they wrap the OVERLAPPED
structure passed in to for example WSARecv
in an Operation
struct
compatible with the expected memory layout for API. This lets them cast the pointer to the Operation
struct to *OVERLAPPED
when passing it in to WSARecv
and then back to a Operation
struct when the event has occurred thereby getting the rest of the context for that exact event.
The above suggestion of using the
lp_completion_key
field might not work since it's registered on a per resource basis
I think this is fine, given that there's only one IO::FileDescriptor
per resource. But, I wonder if we'd be able to make that guarantee.
I can't see where windows lets you see whether an event was a read or a write completing though...
I'll see if I have some time after my current project is done and see if I can help progress this. Is there a
stdcall
directive in Crystal for working with WinApi or is there another way that is solved?
There's plenty of windows API functions bound already. I think crystal already uses stdcall for all functions on windows.
Actually, since we're compiling for x86-64, everything on windows uses the microsoft x64 calling convention, which is not fastcall. This is WinAPI and crystal functions themselves. So on 64bit windows, there is only one calling convention which makes this all simple. If we ever port to 32bit windows we might have to sort this out.
I can't see where windows lets you see whether an event was a read or a write completing though...
I think this is exactly why they wrap OVERLAPPED
to privide this exatra information about the event. I'll see if I can find one more references on how to solve this. I haven't found any information about this in the IOCP documentation, but might have missed something. I do have a POC using this technique in Rust and it seems to work fine.
I think this is exactly why they wrap
OVERLAPPED
to privide this exatra information about the event.
If it works, it's more flexible, and it's what everyone else does, this is just fine to me!
I just checked mio (a Rust implementation of epoll/kqueue/iocp event queue) and it does (well, did since they switched to wepoll recently) the same as I explained with regards to wrapping the OVERLAPPED
structure above check here for the relevant lines of the source code.
It seems to be the a pretty normal technique.
It seems to be the a pretty normal technique.
yeah I prefer this too now I know about it.
I've done the plumbing work to get the IOCP functions wrapped into Crystal. I am somewhat familiar with win32 APIs but I've never worked on an event loop. @cfsamson Would you want to work together on this? I've been reading some of your stuff here to get up to speed.
@incognitorobito Great!
I've been wanting to take this on but have had (and still have) a limited bandwidth. If you can take lead on this I'll try to help push this forward. Great that you found that book. The event loop here should be pretty simple. If I remember correctly Crystal::EventLoop.run_once
will pretty much wrap a call to GetQueuedCompletionStatusEx
and resume the correct fiber on a completion event (or a timeout).
We'll need to wrap WSAOVERLAPPED
in something like an "Operation" struct which gives us a handle to the fiber to resume on completion. We need to keep the memory layout of Operation
compatible with WSAOVERLAPPED
. I write about that all the way down at the end of this chapter. I'm not 100% sure how to do this in Crystal, though.
I put together a basic implementation in #9957. Does it line up with what was discussed here?
By the way, IOCP requires handles with FILE_FLAG_OVERLAPPED
. However,
CONIN$
and CONOUT$
cannot be opened with FILE_FLAG_OVERLAPPED
.
dwFlagsAndAttributes
is ignored for Consoles according to this document.FILE_FLAG_OVERLAPPED
.input
, output
and error
of Process.run
must be handles without FILE_FLAG_OVERLAPPED
. ~Otherwise, process startup fails. I tried to convert a handle with FILE_FLAG_OVERLAPPED
to a handle without FILE_FLAG_OVERLAPPED
by ReOpenFile
but it didn't work.~ See this comment instead of strikethrough text.So event loop for Windows should support I/O without the overlapped flag.
Just an idea to resolve 1 and 2:
FILE_FLAG_OVERLAPPED
, (this is just for comparison with the next)
OVERLAPPED
structureGetQueuedCompletionStatusEx
as described https://github.com/crystal-lang/crystal/pull/9957#issuecomment-731859804.FILE_FLAG_OVERLAPPED
,
OVERLAPPED
structure to a separate thread (thread pool?),PostQueuedCompletionStatus
with the OVERLAPPED structure when the I/O finishes in the separate threadGetQueuedCompletionStatusEx
.As for 3, two ideas. The first:
FIle.new
and File.open
take an argument overlapped : Bool = true
.Process.run
raises an exception when any one of input
, output
and error
arguments is opened with overlapped = true
.
Pros: No need to use a separate thread to request I/O when overlapped = true
.
Cons: Programmers must take care about the overlapped
flag when using Process.run
.The second:
FILE_FLAG_OVERLAPPED
.
Pros: Programmers have no need to take care about the overlapped
flag when using Process.run
.
Cons: Any I/O request uses a separate thread.@kubo Right now I would let file I/O be blocking. Most implementations I've seen uses a threadpool for file I/O (e.g. libuv) but with the advent of io_uring this might change. Since most OS cache frequently accessed files the performance impacts of leaving it blocking might not be that big depending on the concrete use case (for some uses it might be faster since you don't involve a lot of machinery to serve a cached file). However, I see that this might be insufficient for a long term solution, but IMHO we should focus on getting every other piece working first. An interesting article about the subject can be found here.
@incognitorobito I've added some comments in #9957.
@cfsamson I agree with you if file I/O is only disk I/O. However the file I/O API is used for not only real files but also pipes and consoles.
I implemented experimental event loop support of IO::FileDescriptor#read
based on #9957. It passed std_spec
but not compiler_spec
as #9957. It runs blocking read in the default thread pool in Win32 using TrySubmitThreadpoolCallback
.
EDITED: This causes access violation when ReadFile()
returns after the timeout specified here.
I checked the behavior by the following code.
puts("Hit enter to exit.")
spawn do
loop do
sleep(1)
print('.')
end
end
gets
It prints "Hit enter to exit" only with #9957. gets
prevents the loop in the spawn block.
It prints dots periodically with my implementation.
This is a sub-task of porting the stdlib to Windows #5430
Fiber
(#6955)Crystal::System::Fiber
Crystal::Event
,Crystal::EventLoop
(#12149)Thread
,Thread::Mutex
,GC
)] (#11647)I suppose we can delay porting threads by implementing a mock API for
Thread
andThread::Mutex
for win32 which essentially doesn't to anything. That should work perfectly fine for single threaded process.On windows, we should use the win32 API directly instead of libevent (quoting @RX14):
Since the API models are quite different, this will likely require some refactoring of
Crystal::Event
andCrystal::EventLoop
.