Closed dwest1975 closed 5 years ago
Thanks! Seems like a likely bug, though it could be in the perl bindings. If you can provide any more help in creating a reproducer, such as the perl library version in use and core perl version, possibly even a small reproducer, it will go along way to helping us address this.
@dwest1975 , I recommend switching to the more modern, pure-Perl Gearman module:
@SpamapS, the Perl binding module in use is Gearman::XS (0.15). Perl version is 5.16.3. The binding module does not really tinker with the libgearman state, only public interfaces are used and exposed to Perl code.
I will try to narrow down the conditions required to reproduce the crash.
@esabol, thanks. But Gearman module is not a suitable replacement due to a number of other issues.
Gearman::XS 5.16.3 released in August 2013.
I have finally found the root of this issue. It took some time as it is not very frequent (but nonetheless quite annoying).
Basically in gearman_connection_st::receiving there is a path, where data is not received (or received only in part), connection is not closed (thought the receive timeout was reached) and recv_state is set to GEARMAN_CON_RECV_UNIVERSAL_READ. This results in subsequent call of this method to fail with SIGSEGV as the packet_arg is not properly initialized.
The cause of an early exit from gearman_connection_st::receiving is timeout in getting response from the gearmand server (due to overload or network delays). I suppose the recv_state should be reset on non-IO wait errors. I.e.:
--- libgearman/connection.cc.orig 2018-07-09 15:49:35.159410439 +0300
+++ libgearman/connection.cc 2018-07-09 16:18:48.687140243 +0300
@@ -1013,6 +1013,9 @@
size_t recv_size= recv_socket(recv_buffer +recv_buffer_size, GEARMAN_RECV_BUFFER_SIZE -recv_buffer_size, ret);
if (gearman_failed(ret))
{
+ if (ret != GEARMAN_IO_WAIT) {
+ recv_state= GEARMAN_CON_RECV_UNIVERSAL_NONE;
+ }
return NULL;
}
I hope this helps to fix this issue for everyone.
Nice! Open a pull request?
Yes, please do open a pull request and mention that it closes this issue. I'd be thrilled to merge it and cut a release. It would be great if you could figure out a way to reproduce and make a regression test too, but, I'll take the pony of having it fixed over the unicorn of having it fixed and unit tested. ;)
Quoting Ed Sabol (2018-07-09 09:19:45)
Nice! Open a pull request?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.*
PR for the issue is prepared. Unfortunately travis' ci OSX build fails. https://travis-ci.org/p-alik/gearmand/builds/427626409 I haven't Mac OS to investigate the matter.
@SpamapS was working on macOS support, but I don't think it was finished. Can we just ignore that failure for now?
Yes please ignore OS X failures. I thought that was already disabled. Hrm.
Thank you, @dwest1975!
Hi,
libgearman 1.1.12 (RHEL 7)
Periodically my Perl worker dies with the following back-trace:
Basically the crash is caused by an attempt to access universal=0x0 here:
As confirmed by:
The exact conditions for the crash are unclear, but I was hoping you could point me in the right direction how to debug this further. Or maybe the issue is trivial and could be fixed just based on the back-trace.
NB. The client works in blocking mode.
TIA