crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.5k stars 1.62k forks source link

Event Loop hangs after interrupted epoll-wait #10649

Open BrucePerens opened 3 years ago

BrucePerens commented 3 years ago

The event loop hangs after the epoll_wait system call is interrupted. The Crystal code is in HTTP::Client#get but I think these system calls are from libevent. This is what I see on strace:

sendto(5, "\27\3\3\1\16\362:\316\376y\204O\261\20\"\355\340Jg\343\320\334(\204;\337 >2\245\355O"..., 275, 0, NULL, 0) = 275
recvfrom(5, 0x5589644893b3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(6, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=5}}) = 0
epoll_wait(6, 0x55896446a110, 32, 4860) = -1 EINTR (Interrupted system call)
epoll_wait(6, [], 32, 5000)             = 0
epoll_wait(6, [], 32, 5000)             = 0

and it just keeps doing that epoll_wait with an empty event structure over and over again. It looks like because of the interrupted system call, it has dropped the event for FD 5 becoming readable. One solution might be to retry all I/Os in non-blocking mode when this happens. Or maybe it's really simple and there is a signal we can mask? The system is Debian Testing on X86-64-Linux-GNU This is a days-long API client run, and it runs for about 12 hours before this happens.

Is there anything more you would like me to do to instrument this probllem?

BrucePerens commented 3 years ago

Also hangs with -Dpreview_mt

docelic commented 3 years ago

Thanks for updating this thread, Bruce.

BrucePerens commented 3 years ago

Thanks for updating this thread, Bruce.

You're welcome! At the moment, I don't know if this is:

However, the Crystal system code is still young, and it's likely to have untested bugs that will only pop up if you do something like make 50K API calls overnight.