fix issue with edge triggered(EPOLLET) read monitoring(EPOLLIN) that …

moihn commented 3 years ago

Hi,

This is for fixing an issue that we encountered when using libmicrohttpd with external epoll mode.

When client request data size is bigger than that of connection read buffer(default or configured size), data is not fully read out of socket OS read buffer in a single round of processing, and residual data left unread will not trigger next epoll_wait() to return, as libmicrohttpd is using EPOLLET(edge-triggered) when adding client socket to epoll instance.

In this case, the call_handlers() function should try to read all available request data from socket before finishing this round of read_ready=true handling.

We simply have tried to loop over the MHD_connection_handle_read() and MHD_connection_handle_idle() to read and process as much data as possible in single round of read-ready processing, and this solves the issue.

Please feel free to review and let us know if anything is wrong in our pull request.

Thanks and regards

Nan

Karlson2k commented 3 years ago

You changed behaviour for any kind of polling function: for select(), poll(), and epoll, while edge-trigger is used only for epoll.
With proposed change there are a risk that MHD will stick on processing of single connection only (if CPU is slow and network is fast) and left all other connections without services (processing).
MHD already correctly handles edge-triggered events. Please check the code: Connection is marked with read-ready flag as soon as epoll return information about it. Marked connection is added to the e-ready list: https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/microhttpd/daemon.c#L4961 Read handler is called if connection is "read-ready": https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/microhttpd/daemon.c#L5046 Read-ready flag is removed if no more data in the network buffers https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/microhttpd/connection.c#L204 or if read is only partial (buffer is larger than available data): https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/microhttpd/connection.c#L227 Then connection is removed for e-ready list if no more data is available (or read is not needed) https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/microhttpd/daemon.c#L5054. If e-ready list is not empty then epoll timeout is set to zero (just to update current statuses) and data is processed in the next round.

Actually repeating read from connection in the single round make no sense because either MHD read full buffer that MHD wish to process in this round or amount of available data is smaller than available buffer.

Do you have any specific issue with edge-triggering or just found something to improve by reading the code?

moihn commented 3 years ago

Hi, thanks for your response.

I totally agree that read from connection in a single round isn't a good solution at all.

In our case, we are using external epoll() in our application, and we have a large size of http post data(bigger than connection's read buffer size).

In our use case scenario, the first time that client connection is established, our external epoll_wait() call can correctly return with MHD epoll descriptor(which is an umbrella descriptor for server socket and all client sockets of MHD), and we then call MHD_run() for it to process the connection. At this moment, MHD can read and process data from connection(that is one full read buffer of data) and finish this round by return from MHD_run(). Then epoll_wait() is called again, but this time it won't return the MHD file descriptor any more. This caused our application server to hang.

After investigation, we found that the first round of MHD_run()/MHD_epoll()+MHD_connection_handle_read() didn't finish all available data in the client socket, and as we are using EPOLLET(edge-triggered) monitoring with it, the residual data in the socket won't trigger a second epoll_wait() to return this client socket ready event.

This is described exactly in man 7 epoll page as:

       If the rfd file descriptor has been added to the epoll interface
       using the EPOLLET (edge-triggered) flag, the call to
       epoll_wait(2) done in step 5 will probably hang despite the
       available data still present in the file input buffer;

and we have encountered exactly this hang.

So we found that if client connection socket is monitored using EPOLLET(edge-triggered), it is required that we read all available client socket data before going back to epoll_wait() again, otherwise epoll_wait() won't be able to inform us the socket ready state.

Another possible solution is to use level triggered, which will behave similar to poll(), so we don't need to read all available data of client socket before returning to epoll_wait(). But as MHD starts with write monitoring(EPOLLOUT) of client socket, it may cause busy-waiting, and it requires some other code change to work better(mainly not monitor EPOLLOUT too early, but only after response is ready, and remove EPOLLOUT monitoring after the request handling has fully completed).

Karlson2k commented 3 years ago

You need to use MHD API correctly. See https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/include/microhttpd.h#L2685, https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/include/microhttpd.h#L2740, https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/include/microhttpd.h#L2125, https://github.com/Karlson2k/libmicrohttpd/blob/v0.9.73/src/include/microhttpd.h#L2562.

You should not indefinitely wait with epoll_wait(). If MHD has more data to process in the next round then MHD_get_timeout() return zero timeout. Edge-triggered data is not the only case. Data may wait in TLS buffers, also timed-out connections must be closed.

I will try to make documentation even more explicit.

moihn commented 3 years ago

Thanks for this clarification. I see what you mean now, but I think there is still a problem even if next call to epoll_wait() uses the 0 second timeout returned by MHD_get_timeout().

Actually we didn't wait indefinitely on epoll_wait(), we have a max wait time out, which is MIN(out_timeout_max, MHD_get_timeout()). So when there is data left in client connection socket, our next epoll_wait() is actually called using 0 timeout, but epoll_wait()'s result didn't indicate any readiness of descriptor being monitored, as the umbrella mhd_epoll_fd has client socket in edge-triggered mode, and residual data in the client socket won't make the mhd_epoll_fd ready.

When a next epoll_wait() call with 0 timeout doesn't tell us mhd_epoll_fd is ready, we won't call into MHD_run(), as our application monitors not only mhd_epoll_fd, but also other descriptors that our application is using outside MHD.

If I understand correctly, we might be good if we do (pseudo code)

for (...)
{
  epoll_wait(...);
  if (mhd_epoll_fd is ready)
  {
    MHD_Result ok;
    do
    {
      ok = MHD_run();
    } while (ok && MHD_get_timeout() == 0)
  }
}

Is this ok?

Karlson2k commented 3 years ago

Actually we didn't wait indefinitely on epoll_wait(), we have a max wait time out, which is MIN(out_timeout_max, MHD_get_timeout()). So when there is data left in client connection socket, our next epoll_wait() is actually called using 0 timeout, but epoll_wait()'s result didn't indicate any readiness of descriptor being monitored, as the umbrella mhd_epoll_fd has client socket in edge-triggered mode, and residual data in the client socket won't make the mhd_epoll_fd ready.

MHD_run() must be called every time when epoll_wait() returns regardless of the state of mhd_epoll_fd.

Pseudo code for correct approach is:

while (!must_exit())
{
  MHD_UNSIGNED_LONG_LONG mhd_timeout;
  int ep_timeout;

  if (MHD_NO == MHD_get_timeout(&mhd_timeout))
    ep_timeout = -1;
  else
    ep_timeout = (mhd_timeout > INT_MAX) ? INT_MAX : (int)mhd_timeout;

  epoll_wait(..., ep_timeout);
  if (MHD_NO == MHD_run())
    break;

  do_other_things();  
}

moihn commented 3 years ago

Thanks for your response and the example pseudo code.

MHD_run() must be called every time when epoll_wait() returns regardless of the state of mhd_epoll_fd.

This point is very important for external epoll mode, I think.

I will try to update our code to work in this way.

Karlson2k commented 3 years ago

If you want to save some CPU resources, you can use more complex code, like:

while (!must_exit())
{
  MHD_UNSIGNED_LONG_LONG mhd_timeout;
  int ep_timeout;
  MHD_Result has_mhd_timeout;

  has_mhd_timeout = MHD_get_timeout(&mhd_timeout);
  if (MHD_NO == has_mhd_timeout)
    ep_timeout = -1;
  else
    ep_timeout = (mhd_timeout > INT_MAX) ? INT_MAX : (int)mhd_timeout;

  epoll_wait(..., ep_timeout);
  if (MHD_YES == has_mhd_timeout || mhd_epoll_fd is ready)
  {
    if (MHD_NO == MHD_run())
      break;
  }
  do_other_things();  
}

However I think that previous example is simpler and clearer while resource saving is negligible.

Should we close this PR?

moihn commented 3 years ago

I am going to experiment using your method, and I will let you know soon if that works, so I expect this can be closed soon.

Karlson2k commented 3 years ago

Please report if something is wrong by opening an issue.

moihn commented 3 years ago

Thanks, prototyped with your suggestion and it works.

Karlson2k commented 3 years ago

Thanks, prototyped with your suggestion and it works.

👍

The doxy documentation will be updated to further emphasise this requirement.

Karlson2k commented 3 years ago

Doxy were updated in current git master. Hope it clarifies usage.

Karlson2k / libmicrohttpd

fix issue with edge triggered(EPOLLET) read monitoring(EPOLLIN) that … #11