libmemcached connections hang infinitely when rcv_timeout >= 1 million us

tl; dr; libmemcached connections hang infinitely when rcv_timeout >= 1 million usec (or >= 1 sec)

Details:

When :timeout or :recv_timeout options are set to >= 1 sec, the strace reveals the following:

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 setsockopt(5, SOL_SOCKET, SO_RCVTIMEO, "\0\0\0\0\0\0\0\0@B\17\0\0\0\0\0", 16) = -1 EDOM (Numerical argument out of domain) setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(5, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) connect(5, {sa_family=AF_INET, sin_port=htons(22422), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)poll([{fd=5, events=POLLOUT}], 1, 1000) = 1 ([{fd=5, revents=POLLOUT}]) connect(5, {sa_family=AF_INET, sin_port=htons(22422), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

the man page for setsockopt says that EDOM happens for -ve timeval:

 SO_RCVTIMEO is an option to set a timeout value for input operations.  It
 accepts a struct timeval parameter with the number of seconds and
 microseconds used to limit waits for input operations to complete.  In
 the current implementation, this timer is restarted each time additional
 data are received by the protocol, and thus the limit is in effect an
 inactivity timer.  If a receive operation has been blocked for this much
 time without receiving additional data, it returns with a short count or
 with the error EWOULDBLOCK if no data were received.  The struct timeval
 parameter must represent a positive time interval; otherwise,
 setsockopt() returns with the error EDOM.

Looking at the libmemcached C code, in memcached_connect.c, we noticed that rcv_timeout was set as follows:

216 if (ptr->root->rcv_timeout) 217 { 218 int error; 219 struct timeval waittime; 220 221 waittime.tv_sec= 0; 222 waittime.tv_usec= ptr->root->rcv_timeout; 223 224 error= setsockopt(ptr->fd, SOL_SOCKET, SO_RCVTIMEO, 225 &waittime, (socklen_t)sizeof(struct timeval)); 226 WATCHPOINT_ASSERT(error == 0); 228 }

which means that timeval 'waittime' has a invalid value when rcv_timeout >= 1 sec. This is a good example of why you should check the return status from a system call and not doing so means that you silently ignore error :)

Note that this problem does not happen with connect_timeout because it is used in poll() which expects time in msec

41 while (ptr->fd != -1 && 242 connect(ptr->fd, use->ai_addr, use->ai_addrlen) < 0) 243 { 244 ptr->cachederrno= errno; 245 if (errno == EINPROGRESS || /* nonblocking mode - first return, / 246 errno == EALREADY) /_ nonblocking mode - subsequent returns */ 247 { 248 struct pollfd fds[1]; 249 fds[0].fd = ptr->fd; 250 fds[0].events = POLLOUT; 251 int error= poll(fds, 1, ptr->root->connect_timeout); 252 253 if (error != 1 || fds[0].revents & POLLERR) 254 {

The fix for rcv_timeout is:

if (ptr->root->rcv_timeout >= (1000 * 1000)) {
waittime.tv_sec= ptr->root->rcv_timeout / (1000 * 1000);
waittime.tv_usec= ptr->root->rcv_timeout % (1000 * 1000);
} else {
waittime.tv_sec= 0;
waittime.tv_usec= ptr->root->rcv_timeout;
}

Similar fix for snd_timeout is

arthurnn / memcached

libmemcached connections hang infinitely when rcv_timeout >= 1 million us #67