Open andreamartini opened 12 months ago
Maybe try with the latest release of 2.3.0 first
Thank you for your answer. I get the same problem with 2.3.0 version.
Lately, we have also encountered this pthread_join
deadlock issue since we upgraded from librdkafka version 1 to 2. It did not happen frequently, only something like once in several hundreds of API calls. However, the backtrace is always similar:
(gdb) bt 4
#0 0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fb495be57aa in thrd_join ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x0000560106309f68 in rd_kafka_destroy_app ()
#3 0x0000560106309ff4 in rd_kafka_destroy ()
(More stack frames follow...)
Seems like it is waiting for some thread to join, but either the target thread refused to stop or it did not exist. (blocked for over an hour!)
Some more Gdb context:
(gdb) fr 3
#3 0x0000560106309ff4 in rd_kafka_destroy ()
(gdb) x/3i rd_kafka_destroy+21
0x560106309fec <rd_kafka_destroy+21>: mov rdi,rax
0x560106309fef <rd_kafka_destroy+24>:
call 0x5601063099e7 <rd_kafka_destroy_app>
=> 0x560106309ff4 <rd_kafka_destroy+29>: nop
(gdb) fr 2
#2 0x0000560106309f68 in rd_kafka_destroy_app ()
(gdb) x/3i rd_kafka_destroy_app+1401
0x560106309f60 <rd_kafka_destroy_app+1401>: mov rdi,rax
0x560106309f63 <rd_kafka_destroy_app+1404>:
call 0x56010627a0c0 <thrd_join@plt>
=> 0x560106309f68 <rd_kafka_destroy_app+1409>: test eax,eax
(gdb) fr 1
#1 0x00007fb495be57aa in thrd_join ()
from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) x/3i thrd_join+34
0x7fb495be57a2 <thrd_join+34>: mov rsi,rsp
0x7fb495be57a5 <thrd_join+37>:
call 0x7fb495bda270 <__pthread_clockjoin_ex>
=> 0x7fb495be57aa <thrd_join+42>: test rbx,rbx
(gdb) x/3i __pthread_clockjoin_ex+563
0x7fb495bda4a3 <__pthread_clockjoin_ex+563>: mov eax,ebx
0x7fb495bda4a5 <__pthread_clockjoin_ex+565>: syscall
=> 0x7fb495bda4a7 <__pthread_clockjoin_ex+567>: mov edi,r8d
frame 2
. It turns out that the blocking thread is "rdk:main", and here's its backtrace:(gdb) bt
#0 0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fb495be57aa in thrd_join ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x000056010630a6b0 in rd_kafka_destroy_internal ()
#3 0x000056010630eac1 in rd_kafka_thread_main ()
#4 0x00007fb495bd8f5a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007fb4955a8a2f in clone ()
from /lib/x86_64-linux-gnu/libc.so.6
Looks like it is waiting for another thread!
Using similar methods, I discovered the pthread
struct from frame 2, where $rdi = *(*($rbp-0x20))
:
(gdb) x/7i rd_kafka_destroy_internal+1664
0x56010630a69a <rd_kafka_destroy_internal+1664>:
mov rax,QWORD PTR [rbp-0x20]
0x56010630a69e <rd_kafka_destroy_internal+1668>:
mov rax,QWORD PTR [rax]
0x56010630a6a1 <rd_kafka_destroy_internal+1671>:
lea rdx,[rbp-0x68]
0x56010630a6a5 <rd_kafka_destroy_internal+1675>:
mov rsi,rdx
0x56010630a6a8 <rd_kafka_destroy_internal+1678>:
mov rdi,rax
0x56010630a6ab <rd_kafka_destroy_internal+1681>:
call 0x56010627a0c0 <thrd_join@plt>
=> 0x56010630a6b0 <rd_kafka_destroy_internal+1686>:
mov rax,QWORD PTR [rbp-0x20]
This time the target thread "rdk:broker237" does not seem to be waiting for someone else, but instead in conditional wait:
(gdb) bt
#0 0x00007fb495bdfad8 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fb495be5b39 in cnd_timedwait ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x0000560106465256 in cnd_timedwait_abs ()
#3 0x000056010639a492 in rd_kafka_q_pop_serve ()
#4 0x000056010639a532 in rd_kafka_q_pop ()
#5 0x000056010632f777 in rd_kafka_broker_ops_serve ()
#6 0x000056010632f8bc in rd_kafka_broker_ops_io_serve ()
#7 0x00005601063313fa in rd_kafka_broker_consumer_serve ()
#8 0x0000560106331682 in rd_kafka_broker_serve ()
#9 0x00005601063318d6 in rd_kafka_broker_thread_main ()
#10 0x00007fb495bd8f5a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007fb4955a8a2f in clone ()
from /lib/x86_64-linux-gnu/libc.so.6
So perhaps the main handler failed to wake this thread up from conditional wait?
I'm not familiar with the code base, but I hope the above analysis can bring some insight into why it can be blocking on rd_kafka_destroy
.
Description
I'm usinng librdkafka-2.0.2, and i'm facing with rd_kafka_consumer_close(myHandle) call blocking (once the broker become unreachable). The same problem appears with rd_kafka_destroy(myHandle). if i skip both api calls, a leakage appears each time a connection lost or unreachable broker happen. Is there a way to avoid this call blocking (without running into memory leakage)? thank you in advance
How to reproduce
Simulate lost connection (i,e, using clumsy) before invoke rd_kafka_consumer_close or rd_kafka_destroy. Checklist
Please provide the following information:
<v2.0.2>
<v2.8.2>
<client.id="somevalue";bootstrap.servers= "my_Ip";group.id= "somevalue"; statistics.interval.ms=1000>
<Windows10>
debug=..
as necessary) from librdkafka