awesomized / libmemcached

Resurrection of libmemcached
https://awesomized.github.io/libmemcached/
BSD 3-Clause "New" or "Revised" License
45 stars 26 forks source link

Bug: memcached_server_response_decrement can cause response_count to wrap to max_int #130

Closed jtickle closed 1 year ago

jtickle commented 1 year ago

The circumstances around this event are very rare but we have a high-traffic Moodle instance utilizing memcached and fairly regularly we end up in a situation where apache processes are held open indefinitely, ultimately resulting in systemic failure.

Using gdb I have tracked this down to a failure in the memcached_response function. I don't know how exactly the response_count() is passing 0 on decrement to wrap around to max value, and it's rare, but it happens often enough to notice.

Note line 812 of response.cc:

  if (memcached_is_binary(instance->root) == false
      and memcached_server_response_count(instance) > 1) {

and again just a few line down...

    while (memcached_server_response_count(instance) > 1) {
      memcached_return_t rc =
          _read_one_response(instance, buffer, buffer_length, junked_result_ptr);

Ultimately, that _read_one_response loop is where my server is getting stuck. Somehow, memcached_server_response_count(instance) is passing zero without being caught by this; and then, it's positive, so we sit in this loop forever.

Further down the call stack we sit in a poll function for 5s each time, which is where I started from in tracking down this issue. That call fails, loops back around to memcached_response, which decrements this absurdly high counter and starts the process over once again.

Here is what I was able to pull out of the backtrace from that poll function.

(gdb) bt
#0  0x00007f45c43d5ddd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f45c5d98c4f in poll (__timeout=5000, __nfds=1, __fds=0x7ffd1f9a7c08) at /usr/include/bits/poll2.h:46
#2  memcached_io_poll (inst=0x55a618b49778, events=1, prev_errno=0) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/io.cc:177
#3  0x00007f45c5d99099 in _io_fill (instance=<optimized out>) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/io.cc:393
#4  memcached_io_read (instance=instance@entry=0x55a618b49778, buffer=buffer@entry=0x7ffd1f9a7ff0, length=length@entry=1, nread=@0x7ffd1f9a7cc0: -1)
    at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/io.cc:464
#5  0x00007f45c5d99ac8 in memcached_io_read (nread=@0x7ffd1f9a7cc0: -1, length=1, buffer=0x7ffd1f9a7ff0, instance=0x55a618b49778)
    at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/io.cc:784
#6  memcached_io_readline (instance=0x55a618b49778, buffer_ptr=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse", 
    size=size@entry=350, total_nr=@0x7ffd1f9a7d10: 0) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/io.cc:784
#7  0x00007f45c5d9bca5 in textual_read_one_response (result=0x7ffd1f9a7d80, buffer_length=350, 
    buffer=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse", instance=0x55a618b49778)
    at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:761
#8  _read_one_response (instance=instance@entry=0x55a618b49778, 
    buffer=buffer@entry=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse", buffer_length=buffer_length@entry=350, 
    result=result@entry=0x7ffd1f9a7d80) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:761
#9  0x00007f45c5d9c994 in memcached_response (instance=0x55a618b49778, 
    buffer=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse", buffer_length=350, result=0x55a6184d8490)
    at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:821
#10 0x00007f45c5d95eeb in mget_by_key_real (ptr=0x55a6184d83f0, group_key=<optimized out>, group_key_length=<optimized out>, keys=0x7f45b9661b00, key_length=0x7f45b9661af8, 
    number_of_keys=1, mget_mode=true) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/get.cc:191
#11 0x00007f45c5d96c6b in memcached_mget_by_key (shell=<optimized out>, group_key=<optimized out>, group_key_length=<optimized out>, keys=<optimized out>, key_length=<optimized out>, 
    number_of_keys=<optimized out>) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/get.cc:300
#12 0x00007f45a8210ed9 in php_memc_mget_apply.part.10 () from /usr/lib64/php/modules/memcached.so
#13 0x00007f45a8212b5f in php_memc_get_impl () from /usr/lib64/php/modules/memcached.so
#14 0x00007f45ba61ba6a in execute_ex () from /etc/httpd/modules/libphp7.so
#15 0x00007f45ba61c8d1 in zend_execute () from /etc/httpd/modules/libphp7.so
#16 0x00007f45ba595623 in zend_execute_scripts () from /etc/httpd/modules/libphp7.so
#17 0x00007f45ba533d80 in php_execute_script () from /etc/httpd/modules/libphp7.so
#18 0x00007f45ba61e670 in php_handler () from /etc/httpd/modules/libphp7.so
#19 0x000055a6170e4f90 in ap_run_handler (r=r@entry=0x55a6184cfd90) at config.c:169
#20 0x000055a6170e54d9 in ap_invoke_handler (r=r@entry=0x55a6184cfd90) at config.c:439
#21 0x000055a6170fa12a in ap_process_async_request (r=r@entry=0x55a6184cfd90) at http_request.c:339
#22 0x000055a6170fa412 in ap_process_request (r=r@entry=0x55a6184cfd90) at http_request.c:374
#23 0x000055a6170f65a2 in ap_process_http_sync_connection (c=0x55a618773930) at http_core.c:190
#24 ap_process_http_connection (c=0x55a618773930) at http_core.c:231
#25 0x000055a6170ee5c0 in ap_run_process_connection (c=c@entry=0x55a618773930) at connection.c:41
#26 0x000055a6170ee9d8 in ap_process_connection (c=c@entry=0x55a618773930, csd=<optimized out>) at connection.c:212
#27 0x00007f45c14697ef in child_main (child_num_arg=child_num_arg@entry=6) at prefork.c:707
#28 0x00007f45c1469a35 in make_child (s=0x55a618348370, slot=6) at prefork.c:810
#29 0x00007f45c146a6ce in perform_idle_server_maintenance (p=<optimized out>) at prefork.c:912
#30 prefork_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at prefork.c:1100
#31 0x000055a6170c934e in ap_run_mpm (pconf=pconf@entry=0x55a61831f158, plog=0x55a61834c378, s=0x55a618348370) at mpm_common.c:96
#32 0x000055a6170c20a6 in main (argc=2, argv=0x7ffd1f9aafc8) at main.c:783

And here is where I discovered the ever-decrementing response count:

(gdb) break memcached_response
Breakpoint 8 at 0x7f45c5d9c8d0: memcached_response. (3 locations)
(gdb) step
796       if ((instance->root->flags.no_block) and (memcached_is_processing_input(instance->root) == false))
(gdb) step                                                            

Breakpoint 8, memcached_response (instance=0x55a618b49778, buffer=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse",
    buffer_length=350, result=0x55a6184d8490) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:789
789     memcached_return_t memcached_response(memcached_instance_st *instance, char *buffer,                                                                                                
(gdb) step                                                                                                                                                                                  
812       if (memcached_is_binary(instance->root) == false                                                                                                                                  
(gdb) step                                                                                                                                                                                  
memcached_server_response_count (self=self@entry=0x55a618b49778) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/server.cc:181                     
181       if (self == NULL) {                                                                                                                                                               
(gdb) step                                                                                                                                                                                  
185       return self->cursor_active_;                                                                                                                                                      
(gdb) print self->cursor_active_                                                                                                                                                            
$5 = 4294936406                                                                               
(gdb) continue                                                                                                                                                                              
Continuing.                                                                                                                                                                                 

Breakpoint 8, memcached_response (instance=instance@entry=0x55a618b49778,                                                                                                                   
    buffer=buffer@entry=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse", buffer_length=buffer_length@entry=350,      
    result=result@entry=0x55a6184d8490) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:791
791       if (memcached_is_udp(instance->root)) {                                                                                                                                           
(gdb) step                                                                                                                                                                                  
796       if ((instance->root->flags.no_block) and (memcached_is_processing_input(instance->root) == false))                                                                                
(gdb) step                                                                                                                                                                                  

Breakpoint 8, memcached_response (instance=0x55a618b49778, buffer=0x7ffd1f9a7ff0 "END\r\n 3031399949core_capabilities-d410703b54e63475f5ffbecef3d603bf 84 26831\r\n_get_devicetypeinuse",   
    buffer_length=350, result=0x55a6184d8490) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/response.cc:789                                      
789     memcached_return_t memcached_response(memcached_instance_st *instance, char *buffer,
(gdb) step                                                                
812       if (memcached_is_binary(instance->root) == false                  
(gdb) step                                                                          
memcached_server_response_count (self=self@entry=0x55a618b49778) at /usr/src/debug/libmemcached-03c0a641f8dbd5cc8a0d9ac62f7e8e2b00f45159/src/libmemcached/server.cc:181
181       if (self == NULL) {                                              
(gdb) step                                                                         
185       return self->cursor_active_;                                                
(gdb) print self->cursor_active_                                                                                                                                                            
$6 = 4294936405

Due to high-traffic, we will definitely see this issue again, although I have no idea how to re-create the conditions to reproduce it. Happy to help if I can.

jtickle commented 1 year ago

I was just looking into making a pull request and see that this is fixed in #124 . Please merge this PR!

m6w6 commented 1 year ago

Thank you for verifying!