gearman / gearmand

http://gearman.org/
Other
738 stars 138 forks source link

SIGSEGV in gearman_connection_st::receiving #155

Closed dwest1975 closed 5 years ago

dwest1975 commented 6 years ago

Hi,

libgearman 1.1.12 (RHEL 7)

Periodically my Perl worker dies with the following back-trace:

#0  0x00007f7a9b9f5fcb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00007f7a9306d5c4 in skgesigOSCrash () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#2  0x00007f7a93344de6 in kpeDbgSignalHandler () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#3  0x00007f7a9306d7b5 in skgesig_sigactionHandler () from /usr/lib/oracle/12.1/client64/lib/libclntsh.so.12.1
#4  <signal handler called>
#5  gearman_real_malloc (allocator=..., size=50, func=0x7f7a9504b1df <gearman_connection_st::receiving(gearman_packet_st&, gearman_return_t&, bool)::__func__> "receiving",
    file=0x7f7a9504ab8a "libgearman/connection.cc", line=1037) at libgearman/allocator.cc:52
#6  0x00007f7a9503a4ef in gearman_connection_st::receiving (this=0x73ad970, packet_arg=..., ret=@0x7ffd44473d54: GEARMAN_SUCCESS, recv_data=recv_data@entry=true)
    at libgearman/connection.cc:1037
#7  0x00007f7a95044ef0 in gearman_worker_grab_job (worker_shell=<optimized out>, job=job@entry=0x0, ret_ptr=ret_ptr@entry=0x7ffd44473d54) at libgearman/worker.cc:829
#8  0x00007f7a95045f47 in gearman_worker_work (worker_shell=<optimized out>) at libgearman/worker.cc:1089
#9  0x00007f7a952678c8 in XS_Gearman__XS__Worker_work () from /usr/lib64/perl5/vendor_perl/auto/Gearman/XS/XS.so
#10 0x00007f7a9ca322df in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so
#11 0x00007f7a9ca2aa56 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so
#12 0x00007f7a9c9c7865 in perl_run () from /usr/lib64/perl5/CORE/libperl.so
#13 0x0000000000400d99 in main ()

Basically the crash is caused by an attempt to access universal=0x0 here:

(gdb) f 5
#5  gearman_real_malloc (allocator=..., size=50, func=0x7f7a9504b1df <gearman_connection_st::receiving(gearman_packet_st&, gearman_return_t&, bool)::__func__> "receiving", 
    file=0x7f7a9504ab8a "libgearman/connection.cc", line=1037) at libgearman/allocator.cc:52
52        if (allocator.malloc)
(gdb) list
47      using namespace org::tangent;
48
49      void *gearman_real_malloc(gearman_allocator_t& allocator, size_t size, const char *func, const char *file, int line)
50      {
51        void *ptr;
52        if (allocator.malloc)
53        {
54          ptr= allocator.malloc(size, allocator.context);
55        }
56        else
(gdb) f 6
#6  0x00007f7a9503a4ef in gearman_connection_st::receiving (this=0x73ad970, packet_arg=..., ret=@0x7ffd44473d54: GEARMAN_SUCCESS, recv_data=recv_data@entry=true)
    at libgearman/connection.cc:1037
1037        packet_arg.data= gearman_malloc((*packet_arg.universal), packet_arg.data_size);
(gdb) list
1032          recv_state= GEARMAN_CON_RECV_STATE_READ_DATA;
1033          break;
1034        }
1035
1036        assert(packet_arg.universal);
1037        packet_arg.data= gearman_malloc((*packet_arg.universal), packet_arg.data_size);
1038        if (packet_arg.data == NULL)
1039        {
1040          ret= gearman_error(universal, GEARMAN_MEMORY_ALLOCATION_FAILURE, "gearman_malloc((*packet_arg.universal), packet_arg.data_size)");
1041          close_socket();

As confirmed by:

(gdb) p packet_arg 
$134 = (gearman_packet_st &) @0xe12e860: {options = {is_allocated = false, complete = false, free_data = false}, magic = GEARMAN_MAGIC_RESPONSE, 
  command = GEARMAN_COMMAND_JOB_ASSIGN_ALL, argc = 4 '\004', args_size = 103, data_size = 50, universal = 0x0, next = 0x0, prev = 0x0, args = 0xe12e928 "", data = 0x0, arg = {
    0xe12e934 "H:xxxxxxx.xxxxxxxxxx.net:1172121658", 0xe12e958 "admin::vd_notify", 0xe12e969 "5dbcb214-084b-11e8-91fb-b82a72dca78b", 0xe12e98e "", 0x0, 0x0, 0x0, 0x0}, arg_size = {36, 
    17, 37, 1, 0, 0, 0, 0}, 
  args_buffer = "\000RES\000\000\000(\000\000\000\215H:xxxxxxx.xxxxxxxxxx.net:1172121658\000admin::vd_notify\000\065dbcb214-084b-11e8-91fb-b82a72dca78b\000\000-b82a72dc9def", '\000' <repeats 11 times>}
(gdb) p ((Worker*)0x738b140).con
$135 = (gearman_connection_st *) 0x73ad970
(gdb) p *((Worker*)0x738b140).con
$136 = {options = {server_options_sent = true, identifier_sent = true, ready = true, packet_in_use = false}, state = 3, send_state = 0, recv_state = 1, _events = 0, _revents = 1, 
  fd = 8, _ssl = 0x0, cached_errno = 0, created_id = 0, created_id_next = 0, send_buffer_size = 0, send_data_size = 0, send_data_offset = 0, recv_buffer_size = 218, 
  recv_data_size = 50, recv_data_offset = 0, universal = @0x738b1a0, next = 0x73a92a0, prev = 0x0, context = 0x0, _addrinfo = 0x186a160, addrinfo_next = 0x186a160, 
  send_buffer_ptr = 0x73adf79 "", recv_buffer_ptr = 0x73affe0 "{\"i_account\":\"1889389\",\"t_start\":1517598042.30168}", _packet = {options = {is_allocated = false, complete = false, 
      free_data = false}, magic = GEARMAN_MAGIC_TEXT, command = GEARMAN_COMMAND_TEXT, argc = 0 '\000', args_size = 0, data_size = 0, universal = 0x0, next = 0x0, prev = 0x0, 
    args = 0x0, data = 0x0, arg = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0, 0}, args_buffer = '\000' <repeats 127 times>}, 
  _host = "10.71.10.13", '\000' <repeats 1013 times>, _service = "4730", '\000' <repeats 27 times>, 
  send_buffer = "\000REQ\000\000\000'\000\000\000\000H:xxxxxxx.xxxxxxxxxx.net:1172121596\000{\"t_start\":1517598027.45778,\"status\":0,\"i_customer\":\"176396\"}\"notify\":0}", '\000' <repeats 8071 times>, 
  recv_buffer = "\000RES\000\000\000(\000\000\000\215H:xxxxxxx.xxxxxxxxxx.net:1172121658\000admin::vd_notify\000\065dbcb214-084b-11e8-91fb-b82a72dca78b\000\000{\"i_account\":\"1889389\",\"t_start\":1517598042.30168}\000RES\000\000\000(\000\000\000\234H:xxxxxxx.xxxxxxxxxx.net:1172121655"..., _recv_packet = 0xe12e860}

The exact conditions for the crash are unclear, but I was hoping you could point me in the right direction how to debug this further. Or maybe the issue is trivial and could be fixed just based on the back-trace.

NB. The client works in blocking mode.

TIA

SpamapS commented 6 years ago

Thanks! Seems like a likely bug, though it could be in the perl bindings. If you can provide any more help in creating a reproducer, such as the perl library version in use and core perl version, possibly even a small reproducer, it will go along way to helping us address this.

esabol commented 6 years ago

@dwest1975 , I recommend switching to the more modern, pure-Perl Gearman module:

http://search.cpan.org/~palik/Gearman/lib/Gearman/Worker.pm

dwest1975 commented 6 years ago

@SpamapS, the Perl binding module in use is Gearman::XS (0.15). Perl version is 5.16.3. The binding module does not really tinker with the libgearman state, only public interfaces are used and exposed to Perl code.

I will try to narrow down the conditions required to reproduce the crash.

@esabol, thanks. But Gearman module is not a suitable replacement due to a number of other issues.

p-alik commented 6 years ago

Gearman::XS 5.16.3 released in August 2013.

dwest1975 commented 6 years ago

I have finally found the root of this issue. It took some time as it is not very frequent (but nonetheless quite annoying).

Basically in gearman_connection_st::receiving there is a path, where data is not received (or received only in part), connection is not closed (thought the receive timeout was reached) and recv_state is set to GEARMAN_CON_RECV_UNIVERSAL_READ. This results in subsequent call of this method to fail with SIGSEGV as the packet_arg is not properly initialized.

The cause of an early exit from gearman_connection_st::receiving is timeout in getting response from the gearmand server (due to overload or network delays). I suppose the recv_state should be reset on non-IO wait errors. I.e.:

--- libgearman/connection.cc.orig       2018-07-09 15:49:35.159410439 +0300
+++ libgearman/connection.cc    2018-07-09 16:18:48.687140243 +0300
@@ -1013,6 +1013,9 @@
       size_t recv_size= recv_socket(recv_buffer +recv_buffer_size, GEARMAN_RECV_BUFFER_SIZE -recv_buffer_size, ret);
       if (gearman_failed(ret))
       {
+        if (ret != GEARMAN_IO_WAIT) {
+          recv_state= GEARMAN_CON_RECV_UNIVERSAL_NONE;
+        }
         return NULL;
       }

I hope this helps to fix this issue for everyone.

esabol commented 6 years ago

Nice! Open a pull request?

SpamapS commented 6 years ago

Yes, please do open a pull request and mention that it closes this issue. I'd be thrilled to merge it and cut a release. It would be great if you could figure out a way to reproduce and make a regression test too, but, I'll take the pony of having it fixed over the unicorn of having it fixed and unit tested. ;)

Quoting Ed Sabol (2018-07-09 09:19:45)

Nice! Open a pull request?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*

p-alik commented 6 years ago

PR for the issue is prepared. Unfortunately travis' ci OSX build fails. https://travis-ci.org/p-alik/gearmand/builds/427626409 I haven't Mac OS to investigate the matter.

esabol commented 6 years ago

@SpamapS was working on macOS support, but I don't think it was finished. Can we just ignore that failure for now?

SpamapS commented 6 years ago

Yes please ignore OS X failures. I thought that was already disabled. Hrm.

p-alik commented 5 years ago

170 resolved the issue.

Thank you, @dwest1975!