When a partition is absent from metadata it is delegated to the internal broker :0/internal.
If the partition comes back later, with the same leader as before, it's no longer delegated back from the internal broker.
This makes the partition unusable, causing producing and consuming from said partition to stall.
This change requires that the epoch has been changed (leader_epoch > rktp->rktp_leader_epoch) before calling rd_kafka_toppar_broker_update(), which is not the case in this situation.
Description
When a partition is absent from metadata it is delegated to the internal broker
:0/internal
. If the partition comes back later, with the same leader as before, it's no longer delegated back from the internal broker. This makes the partition unusable, causing producing and consuming from said partition to stall.The issue is likely a result of this change https://github.com/confluentinc/librdkafka/commit/6584ed7c8b00786121c07bc0df5b3d7fa8da2661 in v2.4.0 (PR #4680)
This change requires that the epoch has been changed (
leader_epoch > rktp->rktp_leader_epoch
) before callingrd_kafka_toppar_broker_update()
, which is not the case in this situation.This commit also affects the related testcase 107 which fails. This related test scenario only seem to pass if the commit https://github.com/confluentinc/librdkafka/commit/6584ed7c8b00786121c07bc0df5b3d7fa8da2661 is reverted.
Logs where the issue can be seen:
Running test 107:
Checklist
v2.4.0
3.6.2
SLES
debug=..
as necessary) from librdkafkaCC @emasab