confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
7.37k stars 3.11k forks source link

Revert 8e20e1ee (#4117) to fix hang in destruction of groupconsumer #4667

Open Quuxplusone opened 2 months ago

Quuxplusone commented 2 months ago

We observed that destroying a groupconsumer would often hang waiting for the broker thread to exit. We tediously bisected the problem to the specific commit 8e20e1ee (the last commit before the v2.0.0rc1 tag). Only then did we find that a lot of people on GitHub were already complaining about that commit as introducing a resource leak: the commit adds a call to rd_kafka_toppar_keep that bumps the refcount of the toppar, and I don't immediately see a corresponding rd_kafka_toppar_destroy anywhere.

Reverting 8e20e1ee (as in this commit) does fix the hang in groupconsumer destruction which we were observing, so we've applied this patch to our downstream library.

Fixes #4486.

emasab commented 2 months ago

Hello @Quuxplusone , thanks for investigating this issue, the solution isn't reverting the commit as you see there were failing tests that were fixed. The rd_kafka_toppar_destroy is usually called here.

But that happens when the op is destroyed, maybe there are cases where the BARRIER op isn't destroyed. I have found a similar refcnt issue in test 0113, subtest n_wildcard, but happening sporadically, and there a topic is deleted. Does it happen to you when a topic is deleted too?