Open blindspotbounty opened 3 months ago
Additionally have the following log of refcounts:
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 0 +1: 0x7f11d0018df8: rd_kafka_broker_add:4978
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 1 +1: 0x7f11d0018df8: rd_kafka_broker_add:5046
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 2 +1: 0x7f11d0018df8: rd_kafka_broker_monitor_add:6091
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 3 +1: 0x7f11d0018df8: rd_kafka_broker_update:5593
REFCNT DEBUG: &(rkb)->rkb_refcnt 4 -1: 0x7f11d0018df8: rd_kafka_broker_update:5618
REFCNT DEBUG: &(rkb)->rkb_refcnt 3 +1: 0x7f11d0018df8: rd_kafka_broker_update:5580
REFCNT DEBUG: &(rkb)->rkb_refcnt 4 -1: 0x7f11d0018df8: rd_kafka_broker_update:5618
REFCNT DEBUG: &(rkb)->rkb_refcnt 3 +1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1315
REFCNT DEBUG: &(rkb)->rkb_refcnt 4 +1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1315
REFCNT DEBUG: &(rkb)->rkb_refcnt 5 +1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1315
REFCNT DEBUG: &(rkb)->rkb_refcnt 6 +1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1315
REFCNT DEBUG: (&(leader)->rkb_refcnt) 7 +1: 0x7f11d0018df8: rd_kafka_toppar_leader_update:744
REFCNT DEBUG: (&(new_rkb)->rkb_refcnt) 8 +1: 0x7f11d0018df8: rd_kafka_toppar_broker_migrate:1045
REFCNT DEBUG: &(leader)->rkb_refcnt 9 -1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1405
REFCNT DEBUG: (&(leader)->rkb_refcnt) 8 +1: 0x7f11d0018df8: rd_kafka_toppar_leader_update:744
REFCNT DEBUG: (&(new_rkb)->rkb_refcnt) 9 +1: 0x7f11d0018df8: rd_kafka_toppar_broker_migrate:1045
REFCNT DEBUG: &(leader)->rkb_refcnt 10 -1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1405
REFCNT DEBUG: (&(leader)->rkb_refcnt) 9 +1: 0x7f11d0018df8: rd_kafka_toppar_leader_update:744
REFCNT DEBUG: (&(new_rkb)->rkb_refcnt) 10 +1: 0x7f11d0018df8: rd_kafka_toppar_broker_migrate:1045
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 11 +1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3350
REFCNT DEBUG: &(rktp->rktp_next_broker)->rkb_refcnt 12 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3367
REFCNT DEBUG: &(leader)->rkb_refcnt 11 -1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1405
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 11 +1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3350
REFCNT DEBUG: &(rktp->rktp_next_broker)->rkb_refcnt 11 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3367
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 10 +1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3350
REFCNT DEBUG: &(rktp->rktp_next_broker)->rkb_refcnt 11 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3367
REFCNT DEBUG: (&(leader)->rkb_refcnt) 10 +1: 0x7f11d0018df8: rd_kafka_toppar_leader_update:744
REFCNT DEBUG: (&(new_rkb)->rkb_refcnt) 11 +1: 0x7f11d0018df8: rd_kafka_toppar_broker_migrate:1045
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3350
REFCNT DEBUG: &(rktp->rktp_next_broker)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3367
REFCNT DEBUG: &(leader)->rkb_refcnt 12 -1: 0x7f11d0018df8: rd_kafka_topic_metadata_update:1405
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 11 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 13 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 14 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 15 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 16 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 17 +1: 0x7f11d0018df8: rd_kafka_broker_trigger_monitors:6055
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 18 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rko->rko_u.broker_monitor.rkb)->rkb_refcnt 18 -1: 0x7f11d0018df8: rd_kafka_op_destroy:473
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 16 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 15 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 16 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 15 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 15 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 15 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 16 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 15 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 14 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 15 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 14 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 13 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 14 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 12 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 11 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_broker_weighted:1482
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 13 +1: 0x7f11d0018df8: rd_kafka_buf_new_request0:162
REFCNT DEBUG: &(rkb)->rkb_refcnt 14 -1: 0x7f11d0018df8: rd_kafka_metadata_refresh_topics:1423
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 13 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 14 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: (&(rkbuf->rkbuf_rkb)->rkb_refcnt) 12 +1: 0x7f11d0018df8: rd_kafka_req_response:1945
REFCNT DEBUG: &(rkb)->rkb_refcnt 13 +1: 0x7f11d0018df8: rd_kafka_broker_update:5580
REFCNT DEBUG: &(rkb)->rkb_refcnt 14 -1: 0x7f11d0018df8: rd_kafka_broker_update:5618
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 13 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rkbuf->rkbuf_rkb)->rkb_refcnt 12 -1: 0x7f11d0018df8: rd_kafka_buf_destroy_final:78
REFCNT DEBUG: &(rktp->rktp_broker)->rkb_refcnt 11 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3443
REFCNT DEBUG: &(rktp->rktp_broker)->rkb_refcnt 10 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3443
REFCNT DEBUG: &(rktp->rktp_broker)->rkb_refcnt 9 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3443
REFCNT DEBUG: &(rktp->rktp_broker)->rkb_refcnt 8 -1: 0x7f11d0018df8: rd_kafka_broker_op_serve:3443
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 7 +1: 0x7f11d0018df8: rd_kafka_broker_weighted:1482
REFCNT DEBUG: &(good)->rkb_refcnt 8 -1: 0x7f11d0018df8: rd_kafka_broker_weighted:1481
REFCNT DEBUG: &(rkb)->rkb_refcnt 7 +1: 0x7f11d0018df8: rd_kafka_broker_update:5580
REFCNT DEBUG: &(rkb)->rkb_refcnt 8 -1: 0x7f11d0018df8: rd_kafka_broker_update:5618
REFCNT DEBUG: &(rkb)->rkb_refcnt 7 -1: 0x7f11d0018df8: rd_kafka_destroy_internal:1225
REFCNT DEBUG: (&(rkb)->rkb_refcnt) 6 +1: 0x7f11d0018df8: rd_kafka_broker_trigger_monitors:6055
REFCNT DEBUG: &(rko->rko_u.broker_monitor.rkb)->rkb_refcnt 7 -1: 0x7f11d0018df8: rd_kafka_op_destroy:473
REFCNT DEBUG: &(rkb)->rkb_refcnt 6 -1: 0x7f11d0018df8: rd_kafka_broker_monitor_del:6122
But I am not very good aware of librdkafka internals to understand which refcounts were not decremented. Could someone help me check that, please?
Hi @blindspotbounty I think that data race doesn't have to do with the hang on destroy, as both threads are accessing the
rd_kafka_fetch_pos2str
and writing to the static field with the return string (it's possible this leads to an incorrect log message though, but still NULL terminated)
For the hang on destroy, especially if you're removing a topic, try applying this PR https://github.com/confluentinc/librdkafka/pull/4724
Read the FAQ first: https://github.com/confluentinc/librdkafka/wiki/FAQ
Do NOT create issues for questions, use the discussion forum: https://github.com/confluentinc/librdkafka/discussions
Description
I was investigating hang on destroy that specifically reproduces on x86 platform (cannot reproduce with Mac OS/Linux arch64/M2).
How to reproduce
But on x86, it is good reproducible with the following scenario:
Better reproducible with rebalance enabled and assign called from different thread, i.e.:
The only thing that I've found that is pretty good reproducible on x86 platform is the following race (which I don't see on arm):
It is better reproducible with v2.3.0 and less with v2.5.0. With 2.3.0 I can reproduce it without external rebalance and with 2.5.0 with custom rebalance it is much better reproducible.
Reproducing this scenario pretty good with swift wrapper. There is a code in swift that can catch this race:
As the only difference between platforms is this race catched by TSan, I suspect that this might be a problem for future client destroy method.
Remark: I was trying to use kafka_destroy and kafka_destroy_flags with RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE but works the same.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
debug=..
as necessary) from librdkafka Last logs before hang isWhile rebalance assign(NULL) is called further, it is ignored by librdkafka and seems it is the reason for hanging