Gem can block indefinitely on write ops

mheffner commented 12 years ago

We appear to have hit a scenario that can cause the memcached gem to get stuck in an infinite loop blocking on the write buffer. Regardless of how low we set our timeout settings, the client will not return from a write operation. In production this leads to a web request timing out and our Unicorn process killing the worker process handling the request.

I've only done a cursory examination, but from what I can tell this is a bug in the underlying libmemcached client library version shipped with the memcached gem. From a code examination, newer versions of the libmemcached client appear to have fixed this bug.

We are connecting to memcached using the following client options:

Memcached.new(uri,  {
  :default_ttl => 14400,
  :no_block => true,
  :buffer_requests => true,
  :noreply => true,
  :binary_protocol => false,
  :cache_lookups => false,
  :retry_timeout => 30,
  :connect_timeout => 2
})

What we notice is that the client will block in a memcached client operation. It is my believe that this happens when the write-side of the connection blocks either due to a) the client not draining its receive buffer while filling the send buffer or b) from the write socket simply being blocked. We believe that we are hitting a) in production, but I can easily simulate b) by firewalling packets to the memcached server. What I see is the client block with:

#0  poll () from /lib64/libc.so.6
#1  io_wait () at memcached_io.c:49
#2  io_flush () at memcached_io.c:389
#3  memcached_io_write () at memcached_io.c:241
#4  memcached_send () at memcached_storage.c:163

Stepping through the code with gdb we see the follow steps:

In io_flush() the write() returns EAGAIN, which prompts the io_wait() call.
io_wait() calls poll() with a timeout of 250 ms.
poll() times out, prompting io_wait() to return MEMCACHED_TIMEOUT

In io_flush(), the following code is hit (memcached_io.c:389):

   rc= io_wait(ptr, MEM_WRITE);

   if (rc == MEMCACHED_SUCCESS || rc == MEMCACHED_TIMEOUT)
     continue;

The continue will loop back around and attempt the write again, which will return EAGAIN and loop back to poll() again.

So, to break out of this one or both of the following conditions need to happen here:

The client attempts to read from the read side of the socket, thereby unblocking the server-side writer allowing it to drain its read-side of the socket.
Treat a timeout from poll() (MEMCACHED_TIMEOUT) as a permanent failure and bail from this loop.

Looking at version 0.50 of the libmemcached client code, it appears that they actually do both of these steps (from libmemcached/io.cc:732):

        {
          /*
           * We may be blocked on write because the input buffer
           * is full. Let's check if we have room in our input
           * buffer for more data and retry the write before
           * waiting..
         */
          if (repack_input_buffer(ptr) or
              process_input_buffer(ptr))
          {
            continue;
          }

          memcached_return_t rc= io_wait(ptr, MEM_WRITE);
          if (memcached_success(rc))
          {
            continue;
          }
          else if (rc == MEMCACHED_TIMEOUT)
          {
            *error= memcached_set_error(*ptr, MEMCACHED_TIMEOUT, MEMCACHED_AT);
            return -1;
          }

My recommendation would be to upgrade from version 0.32 -> 0.50 of the libmemcached client in the gem. The other option would be backporting the appropriate fixes.

evan commented 12 years ago

Ah, interesting bug. See #27 for why we can't upgrade.

mheffner commented 12 years ago

Hrm, have those perf test been rerun against the newer releases -- appears they are up to 1.0.5? Any idea what the performance gap is from so that an issue can be filed upstream?

evan commented 12 years ago

I haven't re-run it recently...would be worth a look.

At the time the regression was from them deciding to dynamically route (in C!) every function dispatch through a function pointer, as well as some other minor things. Speed has not consistently been a priority for the project.

evan commented 12 years ago

Any progress on a backport of the fix?

mheffner commented 12 years ago

No, I haven't had a chance to backport the fix. This is partly because I wasn't able to identify the actual commit that fixed it -- quite likely due to my lack of bzr skills.

tmm1 commented 11 years ago

Not comprehensive, but this should work for (2) and treat poll timeouts as a permanent failure:

diff --git a/ext/libmemcached-0.32/libmemcached/memcached_io.c b/ext/libmemcached-0.32/libmemcached/memcached_io.c
index 008e663..56d1962 100644
--- a/ext/libmemcached-0.32/libmemcached/memcached_io.c
+++ b/ext/libmemcached-0.32/libmemcached/memcached_io.c
@@ -416,7 +416,10 @@ static ssize_t io_flush(memcached_server_st *ptr,
         memcached_return rc;
         rc= io_wait(ptr, MEM_WRITE);

-        if (rc == MEMCACHED_SUCCESS || rc == MEMCACHED_TIMEOUT)
+        if (rc == MEMCACHED_TIMEOUT) {
+          *error= MEMCACHED_TIMEOUT;
+          return -1;
+        } else if (rc == MEMCACHED_SUCCESS)
           continue;

         memcached_quit_server(ptr, 1);

mheffner commented 11 years ago

@tmm1 Yeah, that should address (2).

evan commented 11 years ago

I think this is also fixed now, right?

tmm1 commented 11 years ago

This was not merged. The tests are green with the patch above, but I haven't really tested it.

evan commented 11 years ago

Possibly already fixed by other changes; will close until we get a reproducible case.

mheffner commented 11 years ago

Cool, https://github.com/evan/memcached/issues/87#issuecomment-10700375 got merged then? /cc @tmm1

evan commented 11 years ago

No, but I think #117 covers it.

arthurnn / memcached

Gem can block indefinitely on write ops #87