confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
7.37k stars 3.11k forks source link

Sporadic crash in rd_kafka_buf_callback() #4673

Open GerKr opened 2 months ago

GerKr commented 2 months ago

Description

In some rare cases the librdkafka.dll crashes. The crashdump shows a bad memory-access while EnterCriticalSection() is executed. For more details look below in the section "How to reproduce".

How to reproduce

As it happens very rarely I could not reproduce it. But I analysed the crashdump and saved following call-stack with some manually added notes. The crashdump comes from the version v1.6.1 of librdkafka. So the line numbers correspond with this version. The line marked with "===>" is never reached, when I tried to reproduce the error.

rd_kafka_broker_ops_serve() rdkafka_broker.c:3345 -> 3351 case RD_KAFKA_OP_TERMINATE rd_kafka_broker_op_serve() rdkafka_broker.c:2950 -> 3276 rd_kafka_broker_fail(rkb, LOG_DEBUG, rdkafka_broker.c:520 -> 577 RD_KAFKA_RESP_ERR__DESTROY, "Client is terminating"); rd_kafka_bufq_purge(..., 2. param: rd_kafka_bufq_t rkbufq=&tmpq_waitresp, ...) rdkafka_buf.c:245 -> 256 TAILQ_FOREACH_SAFE(rkbuf, &rkbufq->rkbq_bufs, rkbuf_link, tmp) rdkafka_buf.c:255 ===> rd_kafka_buf_callback(..., 5.param: rd_kafka_buf_t request=rkbuf) rdkafka_buf.c:450 -> 495 rd_kafka_buf_destroy(rkbuf=request) rdkafka_buf.h:804 macro => rd_refcnt_destroywrapper(REFCNT=&(rkbuf)->rkbuf_refcnt, ...) rd.h:355 macro => rd_refcnt_sub(R=REFCNT) rd.h:401 macro => rd_refcnt_sub0(rd_refcnt_t * R) rd.h:325 -> 328 mtx_lock(&R->lock) EnterCriticalSection()

Additional info: The crashdump withih the EnterCriticalSection() can exactly be reproduced with a simple program, which calls the EnterCriticalSection() without calling the InitializeCriticalSection() before. Exactly this seems to happen, when there are buffers available and the marked line of the call stack is executed.

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/confluentinc/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed. As I don't know how to reproduce the situation, where buffers are available during the purge of kafka-bufq, I can't tell, if the error is still available. A source compare of v1.6.1 against v2.3.0 did not show me, that anything was corrected in this direction.

Proposal for making the code more defensive: In mtx_init() save, that the initialization has taken place. In mtx_lock() check, if initialization has been done. If not, then implicitely do the initialization.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information: