Closed thijsc closed 1 year ago
I was able to get a crash report using Valgrind: crash_report.txt
So it seems to crash here: https://github.com/edenhill/librdkafka/blob/master/src/rdkafka.c#L1105
Which doesn't really tell us much.
I'm not really familiar with your valgrind output, it typically shows the type (r/w) and address of the segmentation fault when I run it.
Try running a librdkafka built with --enable-devel and --disable-optimization and try to get it to crash when running in gdb.
Thanks for the hint. I'll see if I can run GDB in the CI with manual runs.
Any other information I can supply to help figure this out?
There are some more full-fledged reports on core dumps here: https://github.com/appsignal/rdkafka-ruby/issues/216#issuecomment-1275313644
At least one of the segfaults has this backtrace:
850/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(rb_vm_bugreport) vm_dump.c:985 04:46
851/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(rb_bug_context+0xd0) [0x7f5ea71c5530] error.c:609 04:46
852/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(sigsegv+0x42) [0x7f5ea72dc412] signal.c:998 04:46
853/lib/x86_64-linux-gnu/libc.so.6(0x7f5ea6d56f10) [0x7f5ea6d56f10] 04:46
854/lib/x86_64-linux-gnu/libpthread.so.0(__GI___pthread_timedjoin_ex+0x28) [0x7f5ea68e4bd8] 04:46
855/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(thrd_join+0x23) [0x7f5ea1cbc785] tinycthread.c:692 04:46
856/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(rd_kafka_destroy_app+0x5a1) [0x7f5ea1bef4c7] rdkafka.c:1105 04:46
857/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(rd_kafka_destroy+0x1d) [0x7f5ea1bef561] rdkafka.c:1118 04:46
858/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5ea2680dae] 04:46
859/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f5ea268071f]
It appears the pointer to the thread is null in this case?
Here's a core dump with debug symbols enabled of another crash: core.zip
Any chance you could take a look? Otherwise we could maybe get somebody from Confluent involved to help debug.
It'd be interesting to see a thread apply all bt full
output from one of these crashes
Core files are not very useful without their corresponding executable/binary.
We've been making some progress, seems like using a mutex to prevent polling and closing at the same time prevents at least one of the segfaults.
Yeah, you need to make sure there are no other simultaneous calls to rdkafka when you call .._consumer_close() and rd_kafka_destroy().
That’s good to know. That means we might have to wrap all access in a mutex. With Ruby it’s hard to guarantee everything happens in order.
Could there be a similar issue when handling assigns from a rebalance? Should I make sure that no poll is going on during that?
No that's fine, in fact rebalance callbacks are typically triggered from consumer_poll().
@thijsc in our case rebalance callbacks are ok, they never crashed.
Description
We've been chasing a segfault that sometimes occurs in the rdkafka Ruby gem when exiting:
I believe I've followed the termination instructions. Can you give any hints on where we could look next to make this robust?
How to reproduce
The CI build for rdkafka-ruby fails about 25% of the time with a segfault. Full log of a run here.
Checklist
Please provide the following information: