Quuxplusone commented 2 months ago

I'm pulling this out into a new Issue, because we keep discussing it in different random PRs/issues and it'll help to have just one place to discuss it.

On #4667 I wrote:

We observed that destroying a groupconsumer would often hang waiting for the broker thread to exit. We tediously bisected the problem to the specific commit 8e20e1ee (the last commit before the v2.0.0rc1 tag). Only then did we find that a lot of people on GitHub were already complaining about that commit as introducing a resource leak: the commit adds a call to rd_kafka_toppar_keep that bumps the refcount of the toppar, and I don't immediately see a corresponding rd_kafka_toppar_destroy anywhere.

Reverting 8e20e1ee (as in this commit) does fix the hang in groupconsumer destruction which we were observing, so we've applied this patch to our downstream library.

Fixes #4486.

Then in https://github.com/confluentinc/librdkafka/pull/4669#issuecomment-2037799703 I wrote:

@emasab wrote:

@Quuxplusone in case it causes a hang, please try this fix [i.e. #4669] and in case it's still happening tell me how to reproduce it.

Okay, I just tried it. My employer's test case is red with librdkafka v2.3.0, green with librdkafka v2.3.0-plus-#4667, but unfortunately remains red with librdkafka v2.3.0-plus-#4669.

That's not to say that #4669 is useless or fails to fix a real bug; it just doesn't fix our bug.

Then @emasab asked:

Is your failing test doing something specific, like deleting a topic or using the cooperative-sticky assignor?

I can't get super detailed (because the layers of abstraction between librdkafka and the level of the test itself are all three of confusing/internal/proprietary), but I'm pretty sure we don't know anything about "assignors" so we're doing whatever the default is, there. The structure of our test is more or less:

Create two distinct rd_kafka_ts in the same process: one a groupconsumer on topic T1 and broker B1, and the other a producer on topic T2 on broker B2. (Actually, make that four: we have another (non-group) consumer running, and another producer, too, for other purposes. I think it's just four total.)
When making the groupconsumer: enable.auto.commit=false, enable.auto.offset.store=false, auto.offset.reset=smallest. Also, rd_kafka_conf_set_rebalance_cb to a callback that looks for RD_KAFKA_RESP_ERR__ASSIGN_PARTITIONS and remembers how many partitions it saw.
Set the groupconsumer consuming, i.e. loop on rd_kafka_consumer_poll until either it returns NULL or until it has returned RD_KAFKA_RESP_ERR__PARTITION_EOF on T1 as many times as there are partitions. (I suspect this second condition, and thus the rebalance callback, is irrelevant; but I don't know.)
Stop Kafka broker B1, so that the groupconsumer can't talk to Kafka anymore. Wait 3 minutes.
Restart Kafka broker B1. Wait 3 minutes.
(During all this, we're also pushing data to topic T1 on broker B1, so that the groupconsumer has stuff to consume. Also, each time the groupconsumer sees a message, we produce a message to the producer talking to B2 and then commit the offset to the groupconsumer via rd_kafka_commit(rk, toppar, true) where toppar is freshly created with rd_kafka_topic_partition_list_new and destroyed with rd_kafka_topic_partition_list_destroy immediately afterward.)
Shut down the groupconsumer: rd_kafka_consumer_close(rk). This seems to return RD_KAFKA_RESP_ERR_NO_ERROR.
Destroy the groupconsumer: rd_kafka_destroy(rk). This blocks here...

Thread 4 (Thread 0x7fded5ffb700 (LWP 30469)):
#0  0x00007fdeee498017 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000007ca5a2 in thrd_join ()
#2  0x0000000000734b56 in rd_kafka_destroy_app ()
[...]

which I believe is waiting for rd_kafka_thread_main to exit, but it's blocked...

Thread 3 (Thread 0x7fde5f7fe700 (LWP 30478)):
#0  0x00007fdeee498017 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000000007ca5a2 in thrd_join ()
#2  0x0000000000732fef in rd_kafka_destroy_internal ()
#3  0x00000000007389a5 in rd_kafka_thread_main ()
#4  0x00000000007ca2d7 in _thrd_wrapper_function ()
#5  0x00007fdeee496ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fdeee1bfb0d in clone () from /lib64/libc.so.6

which I believe is waiting for one of these threads in rd_kafka_broker_thread_main but I'm not sure which thread. We have a total of 35 threads still extant at this point. Remember that the nongroup consumer and the two producers are all still running.

Thread 7 (Thread 0x7fde5d7fa700 (LWP 20537)):
#0  0x00007fdeee49ade2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ca469 in cnd_timedwait ()
#2  0x0000000000763771 in rd_kafka_q_pop_serve.localalias ()
#3  0x0000000000747198 in rd_kafka_broker_ops_io_serve ()
#4  0x00000000007481cf in rd_kafka_broker_serve ()
#5  0x0000000000748833 in rd_kafka_broker_thread_main ()
#6  0x00000000007ca2d7 in _thrd_wrapper_function ()
#7  0x00007fdeee496ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fdeee1bfb0d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fde5cff9700 (LWP 20538)):
#0  0x00007fdeee1b4ddd in poll () from /lib64/libc.so.6
#1  0x000000000075d25e in rd_kafka_transport_poll ()
#2  0x000000000075f10e in rd_kafka_transport_io_serve ()
#3  0x0000000000747119 in rd_kafka_broker_ops_io_serve ()
#4  0x00000000007477b9 in rd_kafka_broker_consumer_serve ()
#5  0x0000000000747ef1 in rd_kafka_broker_serve ()
#6  0x0000000000748585 in rd_kafka_broker_thread_main ()
#7  0x00000000007ca2d7 in _thrd_wrapper_function ()
#8  0x00007fdeee496ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fdeee1bfb0d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fdeb8ff9700 (LWP 30480)):
#0  0x00007fdeee49ade2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000007ca469 in cnd_timedwait ()
#2  0x0000000000763771 in rd_kafka_q_pop_serve.localalias ()
#3  0x0000000000747198 in rd_kafka_broker_ops_io_serve ()
#4  0x00000000007477b9 in rd_kafka_broker_consumer_serve ()
#5  0x0000000000747ef1 in rd_kafka_broker_serve ()
#6  0x0000000000748833 in rd_kafka_broker_thread_main ()
#7  0x00000000007ca2d7 in _thrd_wrapper_function ()
#8  0x00007fdeee496ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fdeee1bfb0d in clone () from /lib64/libc.so.6

Anyway, I don't think this gives you enough to go on, but at least we have a central place to talk about the hang now, instead of scattering it over different PRs.

emasab commented 2 months ago

@Quuxplusone thanks for the description of the problem. Will try to reproduce it, maybe it's linked to stopping and restarting the broker. Could you try it without rd_kafka_conf_set_rebalance_cb to see if it happens too?

emasab commented 2 months ago

Some issues that could be related https://github.com/confluentinc/librdkafka/issues/4519 https://github.com/confluentinc/librdkafka/issues/4131 https://github.com/confluentinc/librdkafka/issues/3954

Quuxplusone commented 2 months ago

@Quuxplusone thanks for the description of the problem. Will try to reproduce it, maybe it's linked to stopping and restarting the broker. Could you try it without rd_kafka_conf_set_rebalance_cb to see if it happens too?

With vanilla librdkafka v2.3.0, but with our custom rd_kafka_conf_set_rebalance_cb removed from the groupconsumer, our test is still red. With v2.3.0 plus removing our rd_kafka_conf_set_rebalance_cb plus adding #4667, our test remains green. (IOW, whatever our weird rebalance callback is doing, and regardless of whether it's necessary for us in general, it seems irrelevant to this specific test.)

Mrigank11 commented 1 month ago

@Quuxplusone we're also facing similar issues in our use-cases, but it's very sporadic. We're working on a stable reproducer to move further. Checking if you have a public reproducer that you could share?

Currently, I'm trying with the following test-setup, but the issue is reproduced very rarely:

start a cluster with 3 brokers, 1 topic, 1000 partitions
push 10 messages to the topic
start consumer, read 5 messages
stop kafka cluster, wait for 4 minutes, and restart the cluster
read 5 more messages from the consumer
close the consumer

Any ideas for improving the reproducer would be very helpful.

Quuxplusone commented 1 month ago

@Quuxplusone we're also facing similar issues in our use-cases, but it's very sporadic. We're working on a stable reproducer to move further. Checking if you have a public reproducer that you could share?

Nope, I never came up with a reproducer. But we haven't seen the problem at all since we applied #4667. I can add that our test topic definitely didn't have 1000 partitions, and off the top of my head I'm pretty sure it just had 1. Likewise I'm pretty sure we have only 1 broker.

Mrigank11 commented 1 month ago

Thanks for the inputs @Quuxplusone. Using the steps mentioned in my previous comment, I could sporadically reproduce the issue. I tried collecting valgrind traces and noticed a definite memory leak in the cases where I could reproduce the issue:

==3234364== 6,836 (1,000 direct, 5,836 indirect) bytes in 1 blocks are definitely lost in loss record 216 of 223
==3234364==    at 0x4C3CE4B: calloc (vg_replace_malloc.c:1328)
==3234364==    by 0xA02869A: rd_calloc (rd.h:161)
==3234364==    by 0xA02869A: rd_kafka_topic_new0 (rdkafka_topic.c:333)
==3234364==    by 0xA07C92A: rd_kafka_toppar_get2 (rdkafka_partition.c:443)
==3234364==    by 0xA07FC16: rd_kafka_topic_partition_ensure_toppar (rdkafka_partition.c:2964)
==3234364==    by 0xA088476: rd_kafka_assignment_add (rdkafka_assignment.c:732)
==3234364==    by 0xA0664EA: rd_kafka_cgrp_assign (rdkafka_cgrp.c:3659)
==3234364==    by 0xA07494B: rd_kafka_cgrp_handle_assign_op (rdkafka_cgrp.c:4858)
==3234364==    by 0xA07494B: rd_kafka_cgrp_op_serve (rdkafka_cgrp.c:5023)
==3234364==    by 0xA03DD0E: rd_kafka_q_serve (rdkafka_queue.c:553)
==3234364==    by 0xA00D6F3: rd_kafka_thread_main (rdkafka.c:2117)
==3234364==    by 0x55BF2D1: start_thread (pthread_create.c:476)
==3234364==    by 0x5F9AE72: clone (clone.S:95)

Is it of any help in debugging the root cause? This memory leak isn't present when the test-program exits cleanly. Noting that I've also used @Mekk's patch, to trigger the memory leak otherwise it'd be very difficult to identify from the dozen's of objects in the heap.

Cc: @emasab @edenhill

emasab commented 1 month ago

@Quuxplusone could you verify these things?

if it does happen calling only rd_kafka_destroy without closing the consumer first. That will close the consumer too, but after setting a flag.
that all these objects are destroyed before calling rd_kafka_destroy(), even in corner cases

For C, make sure the following objects are destroyed prior to calling rd_kafka_consumer_close() and rd_kafka_destroy():

rd_kafka_message_t

rd_kafka_topic_t

rd_kafka_topic_partition_t

rd_kafka_topic_partition_list_t

rd_kafka_event_t

rd_kafka_queue_t

Quuxplusone commented 1 month ago

that all these objects are destroyed before calling rd_kafka_destroy(), even in corner cases

For C, make sure the following objects are destroyed prior to calling rd_kafka_consumer_close() and rd_kafka_destroy():

rd_kafka_message_t

rd_kafka_topic_t

rd_kafka_topic_partition_t

rd_kafka_topic_partition_list_t

rd_kafka_event_t

rd_kafka_queue_t

Could you explain how to verify that? (Especially since in this case we have multiple independent rd_kafka_t objects alive at once.) Our working hypothesis is still that #4667 fixes a leak of one rd_kafka_toppar_keep. All our observations are consistent with that hypothesis, so after applying #4667 downstream and seeing no more hangs, I haven't looked at this in quite a while. If it's possible to detect leaked toppars, such that I could follow your advice above to make sure I wasn't leaking anything... could you apply the same approach to determine whether librdkafka is leaking?

emasab commented 1 month ago

@Quuxplusone given https://github.com/confluentinc/librdkafka/commit/8e20e1ee79b188ae610aac3a2d2517f7f12dd890 adds a reference from an rd_kafka_op_t of type BARRIER to the topic partition in case you're forwarding a partition to a queue you should verify that queue destruction happens before calling destroy on those rd_kafka_t objects.

I've created PR https://github.com/confluentinc/librdkafka/pull/4724 to fix that case I was mentioning in https://github.com/confluentinc/librdkafka/pull/4667 . You can try it to see if it fixes your case.

emasab commented 3 weeks ago

@Quuxplusone can this be closed, after merging the fix, is #4724 solving your case?

Quuxplusone commented 3 weeks ago

@emasab: I haven't tested #4724. At the moment, it has merge conflicts.

Our working hypothesis is still that https://github.com/confluentinc/librdkafka/pull/4667 fixes a leak of one rd_kafka_toppar_keep. All our observations are consistent with that hypothesis, so after applying https://github.com/confluentinc/librdkafka/pull/4667 downstream and seeing no more hangs, I haven't looked at this in quite a while.

When there's a new release of librdkafka, we'll certainly try upgrading to it, and then we can see whether it's safe for us to revert #4667 or whether we still see the problem. But #4724 isn't even in master yet, let alone a new release, right?

I suggest that you should at least try to figure out whether #4724 is correct; if it is correct, then merge it; if it's incorrect, don't merge it. This should be done completely independently of whether it happens to fix #4674 or not.

emasab commented 2 weeks ago

@Quuxplusone #4667 fixes the leak because it removes the reference to the toppar but that was introduced to fix a bug because if the barrier doesn't contain the toppar but only the epoch, the code removes messages with that epoch even if from different partition, that are never delivered. So the fix cannot be reverted.

4724 is correct in the sense that fixes the hang on destroy that happened sporadically in test 0113 and it can fix cases similar to that test.

Given you cannot reliably reproduce the case except with your test that you cannot share I don't have a way to check if it solves your case. When the fix is released you can try to see if you issue is solved.

I'll also try to reproduce it before or after the fix from the instructions from @Quuxplusone or @Mrigank11.

confluentinc / librdkafka

Starting with 8e20e1ee, after broker goes down and back up, `rd_kafka_destroy` of groupconsumer hangs #4674

4724 is correct in the sense that fixes the hang on destroy that happened sporadically in test 0113 and it can fix cases similar to that test.