Epoll event loop (linux)

ysbaddaden commented 2 months ago

Implements a custom event loop on top of epoll for Linux (and Solaris?), following RFC #7. It's not particularly optimized yet (e.g. it still allocates), but seems to be working.

The EventQueue and Timers types aren't the best solutions (performance wise) but they work and abstract their own logic (keep a list of events per fd / keep a sorted list of timers). We should be able to improve (e.g. linked lists -> ordered linked lists -> skiplists, ...)

ISSUES

[x] ~~#after_fork must close & nullify every Fiber#timerfd & update pending events to use the new fd~~ (not needed with one timerfd per eventloop, see below);
[x] ~~must reset mutex after fork before exec (preview mt);~~
[x] ~~must "close" fd in all eventloops (preview mt);~~
[x] ~~LibC bindings for aarch-linux-musl are wrong;~~
[x] ~~investigate segfault runs on CI;~~
[x] ~~support :preview_mt (broken CI);~~
[x] ~~can't modify a registered fd with EPOLLEXCLUSIVE set~~ (disabled for now);
[x] BUG: a timerfd file descriptor errored or got closed! on CI;
[x] epoll_ctl(EPOLL_CTL_MOD): No such file or directory (RuntimeError) on CI;
[ ] Unhandled exception: pthread_mutex_unlock: Operation not permitted (RuntimeError) on CI and again
[x] ~~infinite loop in interpreter on CI (primitives spec);~~

OPTIMIZE

[x] epoll_event.data should point to Node instead of fd, so we could skip searches;
[x] use eventfd instead of pipe2 to interrupt the loop;
[x] one timerfd per eventloop should be enough (wait until next timer => we can't enqueue another timer on that thread anymore => for MT a parallel enqueue will interrupt epoll_wait); we still need a timerfd because epoll_wait only has millisecond precision; we also need a list of timers to know the next timer, and which timers are ready.

TODO

[ ] don't raise BUG exceptions, print BUG warnings to STDERR;
[ ] more testing: please help to break it :hammer:
[ ] extract fix for #14807 (distinct PR);
[ ] extract fix for double event dequeue in Fiber#cancel_timeout (distinct PR);
[x] cleanup Crystal::Epoll::EventLoop (commented code, debug traces);
[ ] use :ev_epoll flag instead of forcing it for Linux (not yet: I want CI test runs);

ysbaddaden commented 2 months ago

Urgh, supporting fork is painful. I must close the timerfds for each fiber (easy) but then I must mutate all pending events to use the new fd :weary:

ysbaddaden commented 2 months ago

I took care to use EPOLLEXCLUSIVE (available since 2017). I'm not sure we need EPOLLONESHOT since we register the fd once then dequeue readers/writers one by one. Now, there may still be races, especially in a MT environment, but there's little we can do without blocking the other readers/writers (there's no telling if a reader/writer will keep reading/writing).

I didn't know the close quirk. That explains why we unregister the file descriptors before we actually close. Now, if a fd is in multiple epoll instances, oops, ~~this is likely already broken today~~ :fearful: this is working today: we remove the events and the fd from all event bases that referenced the IO object.

ysbaddaden commented 2 months ago

The close quirk needs some experimental verification:

Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?

A6 Yes.

https://linux.die.net/man/4/epoll

The one thing to take care of would be to resume the cached event/fibers from all epoll instances (tricky).

I also found an issue with EPOLLEXCLUSIVE: we can't modify, so either we don't use it or we must del/add when changing the set of events.

yxhuvud commented 2 months ago

The close quirk

Could be they fixed it at some point, there has been a bunch of years since that post was written after all..

ysbaddaden commented 2 months ago

Another man page, a longer answer:

Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file de- scription is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the descriptor is explicitly removed using epoll_ctl() EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file de- scriptors referring to the same underlying file description remain open.

https://man.freebsd.org/cgi/man.cgi?query=epoll&apropos=0&sektion=0&manpath=SuSE

ysbaddaden commented 2 months ago

The preview_mt issues are caused by thread B closing a fd while thread A is waiting; since each thread has its own eventloop this is creating issues, I end up not cleaning up the fd from thread A' epoll instance.

I'm trying to think of an efficient way to iterate the event loops on (something lighter than ThreadLocalValue), but may resort to iterate all EL for starters (~~and check if it fixes the MT issues~~ that fixes the issue).

yxhuvud commented 2 months ago

Perhaps it is possible to have a lookup table fd -> loop (or probably - a bit less contentious in mt scenario - have a list in the FileDescriptor (well no, but you perhaps better understand what I mean compared to other choices of words) that keeps track of what epolls uses it). But it is an icky problem :(

ysbaddaden commented 2 months ago

It looks like it's starting to work... except for the interpreter that now hangs forever :sob:

ysbaddaden commented 2 months ago

Using a release compiler (with libevent), the interpreter segfaults :sob:

Building a compiler with epoll (as CI does) in non release, the interpreter enters a kind of infinite loop (100% CPU, silent strace) after printing the first spec dot (.). The last strace logs are:

epoll_create1(EPOLL_CLOEXEC)            = 13
eventfd2(0, EFD_CLOEXEC)                = 14
timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 15
epoll_ctl(13, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=1958253120, u64=140490338505280}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=1958253072, u64=140490338505232}}) = 0
write(11, "\33[", 2
rite(11, "32", 232)                      = 2
write(11, "m", 1m)                       = 1
write(11, ".", 1.)                       = 1
write(11, "\33[", 2
rite(11, "0", 10)                       = 1
write(11, "m", 1m)                       = 1

It's creating the eventloop instance (epoll, eventfd, timerfd) then writes a colored . then nothing.

Any pointers to debug the interpreter itself?

ysbaddaden commented 2 months ago

I can't get tracing to work in the interpreted code, so I hacked myself into Crystal.trace and I see an infinite list of sched.event_loop traces :raised_eyebrow:

ysbaddaden commented 2 months ago

I only checked whether @events was empty and didn't check if @timers was, too :facepalm:

ysbaddaden commented 2 months ago

CI is finally green and the implementation can be reviewed!

ysbaddaden commented 2 months ago

how to roll out

Either opt-in with a flag... or we're bold and make it the default + a flag to opt-out. That will depend on how stable/performant it will get until the next release.

ysbaddaden commented 2 months ago

Moving back to draft. There are some fixes in #14829 and... I got great ideas from the Go implementation that would allow to skip most of the synchronizations and take full advantage of epoll/kqueue (see https://github.com/ysbaddaden/execution_context/issues/30).

straight-shoota commented 1 month ago

Superseded by #14959

crystal-lang / crystal

Epoll event loop (linux) #14814