Closed ghost closed 3 years ago
@wegylexy commented on Apr 19, 2020, 2:39 AM UTC:
Are you sure it does not happen in Linux 4.18? I have a program that hangs at epoll_wait()
every several hours until it is killed with SIGINT, and spent months trying to dig out way.
@geordieintheshellcode commented on May 29, 2020, 11:36 AM UTC:
I can confirm that we've seen the same issue running on 5.5.5-200.fc31.x86_64 and the suggested patch does fix the issue.
@wegylexy commented on Jul 8, 2020, 7:01 AM UTC:
Until the Linux kernel is fixed, should it default to disable eventfd
?
Duplicates #452.
@neunhoef commented on Jan 21, 2020, 10:35 AM UTC:
I have observed a wakeup problem in a large server application (https://arangodb.com) which uses
boost::asio
. A completion handler which is posted to anio_context
is not executed in a timely fashion. Theio_context
runs only a single thread. The problem seems to occur under Linux >= 5.3 only.I have actually tracked down the problem to what I think is a problem in the Linux kernel. See this link for the bug report. Unfortunately, there was no response to this in over a month now, so we might have to work around this in
boost::asio
or even directly in ArangoDB.Here is the background:
boost::asio
uses aneventfd
to wake up anepoll_loop
when a new completion handler is posted to theio_context
. Furthermore, edge-based wakeup is used, so that the event fires only once. Once the completion handler is posted to theio_context
, to actually wake up theepoll
, the code at:is used, which uses
epoll_ctl
to readd theeventfd
to theepoll
file descriptor in the hope that this resets the edge detection and thus wakes up the file descriptor, since theeventfd
is permanently kept in a readable state.This seems to work beautifully for Linux < 5.3 according to my experiments. However, with both an Ubuntu 5.3 Linux kernel and a vanilla 5.3 Linux kernel very occasional lost wakeups can be observed. I have produced a program (not using
boost::asio
) which shows the problem, it is attached to the above Linux bug report.If the Linux kernel folks do not answer or indeed are of the opinion that this method is not actually supposed to work reliably (I did not find documentation about this), then we might have to work around this problem in
boost::asio
. The following diff solves the problem for me, but might have an impact on performance:What do you guys think about this topic and how could it be resolved for us users of
boost::asio
? Do you have any contacts in the Linux world to resolve this problem?Cheers, Max.
This issue was moved by chriskohlhoff from boostorg/asio#320.