thijsc commented 2 years ago

Description

We've been chasing a segfault that sometimes occurs in the rdkafka Ruby gem when exiting:

/home/semaphore/rdkafka-ruby/lib/rdkafka/native_kafka.rb:44: [BUG] Segmentation fault at 0x00007f4f0dffb9d0
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0062 p:---- s:0333 e:000332 CFUNC  :rd_kafka_destroy
c:0061 p:0044 s:0328 e:000327 METHOD /home/semaphore/rdkafka-ruby/lib/rdkafka/native_kafka.rb:44
c:0060 p:0003 s:0323 e:000322 BLOCK  /home/semaphore/rdkafka-ruby/lib/rdkafka/native_kafka.rb:29 [FINISH]
c:0059 p:---- s:0319 e:000318 CFUNC  :fork
c:0058 p:0022 s:0315 e:000314 BLOCK  /home/semaphore/rdkafka-ruby/spec/rdkafka/producer_spec.rb:461 [FINISH]
c:0057 p:---- s:0307 e:000306 CFUNC  :instance_exec

I believe I've followed the termination instructions. Can you give any hints on where we could look next to make this robust?

How to reproduce

The CI build for rdkafka-ruby fails about 25% of the time with a segfault. Full log of a run here.

Checklist

Please provide the following information:

[x] librdkafka version 1.9.2
[x] Apache Kafka version: Confluent 5.2.5
[x] librdkafka client configuration: Multiple configuration used
[x] Operating system: Ubuntu 18.04
[x] Provide logs - not relevant probably
[x] Provide broker log excerpts - not relevant probably
[ ] Critical issue

thijsc commented 2 years ago

I was able to get a crash report using Valgrind: crash_report.txt

edenhill commented 2 years ago

So it seems to crash here: https://github.com/edenhill/librdkafka/blob/master/src/rdkafka.c#L1105

Which doesn't really tell us much.

I'm not really familiar with your valgrind output, it typically shows the type (r/w) and address of the segmentation fault when I run it.

Try running a librdkafka built with --enable-devel and --disable-optimization and try to get it to crash when running in gdb.

thijsc commented 2 years ago

Thanks for the hint. I'll see if I can run GDB in the CI with manual runs.

thijsc commented 2 years ago

I've managed to get a core dump, is this helpful?

core.zip

thijsc commented 2 years ago

Any other information I can supply to help figure this out?

thijsc commented 2 years ago

There are some more full-fledged reports on core dumps here: https://github.com/appsignal/rdkafka-ruby/issues/216#issuecomment-1275313644

thijsc commented 2 years ago

At least one of the segfaults has this backtrace:

850/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(rb_vm_bugreport) vm_dump.c:985 04:46
851/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(rb_bug_context+0xd0) [0x7f5ea71c5530] error.c:609 04:46
852/home/semaphore/.rbenv/versions/2.6.10/lib/libruby.so.2.6(sigsegv+0x42) [0x7f5ea72dc412] signal.c:998 04:46
853/lib/x86_64-linux-gnu/libc.so.6(0x7f5ea6d56f10) [0x7f5ea6d56f10] 04:46
854/lib/x86_64-linux-gnu/libpthread.so.0(__GI___pthread_timedjoin_ex+0x28) [0x7f5ea68e4bd8] 04:46
855/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(thrd_join+0x23) [0x7f5ea1cbc785] tinycthread.c:692 04:46
856/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(rd_kafka_destroy_app+0x5a1) [0x7f5ea1bef4c7] rdkafka.c:1105 04:46
857/home/semaphore/rdkafka-ruby/lib/rdkafka/../../ext/librdkafka.so(rd_kafka_destroy+0x1d) [0x7f5ea1bef561] rdkafka.c:1118 04:46
858/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f5ea2680dae] 04:46
859/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f5ea268071f]

It appears the pointer to the thread is null in this case?

thijsc commented 2 years ago

Here's a core dump with debug symbols enabled of another crash: core.zip

Any chance you could take a look? Otherwise we could maybe get somebody from Confluent involved to help debug.

edenhill commented 2 years ago

It'd be interesting to see a thread apply all bt full output from one of these crashes

edenhill commented 2 years ago

Core files are not very useful without their corresponding executable/binary.

thijsc commented 2 years ago

We've been making some progress, seems like using a mutex to prevent polling and closing at the same time prevents at least one of the segfaults.

edenhill commented 2 years ago

Yeah, you need to make sure there are no other simultaneous calls to rdkafka when you call .._consumer_close() and rd_kafka_destroy().

thijsc commented 2 years ago

That’s good to know. That means we might have to wrap all access in a mutex. With Ruby it’s hard to guarantee everything happens in order.

thijsc commented 2 years ago

Could there be a similar issue when handling assigns from a rebalance? Should I make sure that no poll is going on during that?

edenhill commented 2 years ago

No that's fine, in fact rebalance callbacks are typically triggered from consumer_poll().

mensfeld commented 2 years ago

@thijsc in our case rebalance callbacks are ok, they never crashed.

confluentinc / librdkafka

Chasing a segfault during termination in rdkafka-ruby #3981

Description

How to reproduce

Checklist