confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
7.37k stars 3.11k forks source link

Fix memory leak while stopping an unknown partition #4669

Open emasab opened 2 months ago

emasab commented 2 months ago

When stopping a partition that isn't known to the client a reference to that partition was kept in the fetchq of the same, leading to a memory leak. Solved by purging the fetch and ops queue of the unknown partition when it's stopped

Quuxplusone commented 2 months ago

In #4486 I wrote:

My employer has also tentatively bisected an issue to https://github.com/confluentinc/librdkafka/commit/8e20e1ee79b188ae610aac3a2d2517f7f12dd890. We see that if we have a groupconsumer, and the broker goes down (that is, we shut down the Kafka broker), and then comes back up, and then we try to shut down the groupconsumer, it simply hangs forever. This doesn't reproduce in 1.9.3rc2 or earlier, but it does reproduce in 2.0.0rc1 through 2.3.0. In fact, it appears not to reproduce in https://github.com/confluentinc/librdkafka/commit/a83cadf5eab3a43f7f0d3dd09f5b1a3e9e88312f but to start reproducing with https://github.com/confluentinc/librdkafka/commit/8e20e1ee79b188ae610aac3a2d2517f7f12dd890.

@emasab wrote:

@Quuxplusone in case it causes a hang, please try this fix and in case it's still happening tell me how to reproduce it.

Okay, I just tried it. My employer's test case is red with librdkafka v2.3.0, green with librdkafka v2.3.0-plus-#4667, but unfortunately remains red with librdkafka v2.3.0-plus-#4669.

That's not to say that #4669 is useless or fails to fix a real bug; it just doesn't fix our bug.

OTOH I also don't claim that #4669 is bug-free. Its new code that takes two mutex locks at once is clearly at risk of deadlock. It would benefit from a code comment explaining (or linking to) the project's lock-ordering policy.

Unfortunately I don't have a reproducer using only librdkafka code; we build several C++ layers on top of librdkafka.

emasab commented 2 months ago

@Quuxplusone thanks for testing it. The code respects the lock ordering policy, that is:

  1. rk main lock
  2. rkb lock
  3. rkt lock
  4. rktp lock

Is your failing test doing something specific, like deleting a topic or using the cooperative-sticky assignor?