Closed mheffner closed 11 years ago
Ah, interesting bug. See #27 for why we can't upgrade.
Hrm, have those perf test been rerun against the newer releases -- appears they are up to 1.0.5? Any idea what the performance gap is from so that an issue can be filed upstream?
I haven't re-run it recently...would be worth a look.
At the time the regression was from them deciding to dynamically route (in C!) every function dispatch through a function pointer, as well as some other minor things. Speed has not consistently been a priority for the project.
Any progress on a backport of the fix?
No, I haven't had a chance to backport the fix. This is partly because I wasn't able to identify the actual commit that fixed it -- quite likely due to my lack of bzr skills.
Not comprehensive, but this should work for (2) and treat poll timeouts as a permanent failure:
diff --git a/ext/libmemcached-0.32/libmemcached/memcached_io.c b/ext/libmemcached-0.32/libmemcached/memcached_io.c
index 008e663..56d1962 100644
--- a/ext/libmemcached-0.32/libmemcached/memcached_io.c
+++ b/ext/libmemcached-0.32/libmemcached/memcached_io.c
@@ -416,7 +416,10 @@ static ssize_t io_flush(memcached_server_st *ptr,
memcached_return rc;
rc= io_wait(ptr, MEM_WRITE);
- if (rc == MEMCACHED_SUCCESS || rc == MEMCACHED_TIMEOUT)
+ if (rc == MEMCACHED_TIMEOUT) {
+ *error= MEMCACHED_TIMEOUT;
+ return -1;
+ } else if (rc == MEMCACHED_SUCCESS)
continue;
memcached_quit_server(ptr, 1);
@tmm1 Yeah, that should address (2).
I think this is also fixed now, right?
This was not merged. The tests are green with the patch above, but I haven't really tested it.
Possibly already fixed by other changes; will close until we get a reproducible case.
Cool, https://github.com/evan/memcached/issues/87#issuecomment-10700375 got merged then? /cc @tmm1
No, but I think #117 covers it.
We appear to have hit a scenario that can cause the memcached gem to get stuck in an infinite loop blocking on the write buffer. Regardless of how low we set our timeout settings, the client will not return from a write operation. In production this leads to a web request timing out and our Unicorn process killing the worker process handling the request.
I've only done a cursory examination, but from what I can tell this is a bug in the underlying libmemcached client library version shipped with the memcached gem. From a code examination, newer versions of the libmemcached client appear to have fixed this bug.
We are connecting to memcached using the following client options:
What we notice is that the client will block in a memcached client operation. It is my believe that this happens when the write-side of the connection blocks either due to a) the client not draining its receive buffer while filling the send buffer or b) from the write socket simply being blocked. We believe that we are hitting a) in production, but I can easily simulate b) by firewalling packets to the memcached server. What I see is the client block with:
Stepping through the code with gdb we see the follow steps:
In io_flush(), the following code is hit (memcached_io.c:389):
continue
will loop back around and attempt the write again, which will return EAGAIN and loop back to poll() again.So, to break out of this one or both of the following conditions need to happen here:
Looking at version 0.50 of the libmemcached client code, it appears that they actually do both of these steps (from libmemcached/io.cc:732):
My recommendation would be to upgrade from version 0.32 -> 0.50 of the libmemcached client in the gem. The other option would be backporting the appropriate fixes.