Closed uglycow closed 2 years ago
I'think we encountered the same issue as described in #2944 , all the log matches. but we didn't rolling update the brokers. the brokers experienced leader re-election because one of the broker disconnected from zookeeper.
we kept encountering this error while we have been using the fixed version ( librdkafka-1.6.1 )
So I checked the fix #3238 , I think the error still got a chance to arise.
because rkb_nodename_epoch are incremented and used in two different thread, and two increment might be digested with one assingment in rd_kafka_broker_connect, so the second connect might use the resolving result of the first connect attempt.
a simple sequence-diagram to show the case :
still not 100% sure about it. @edenhill @ajbarb would you plz take a look at this.
I don't see the v1.6.1 having the fix: https://github.com/edenhill/librdkafka/blob/v1.6.1/src/rdkafka_broker.c#L2103
Can you try setting broker.address.ttl to 0 and see if this solves the problem? The rkb_nodename_epoch is updated after rkb_nodename. In that case even if the first rd_kafka_broker_connect() used the 2nd/aggregate rkb_nodename_epoch it should connect to the updated rkb_nodename.
Also to confirm if this is using invalid broker ip for cgrp, correlate these two log messages from:
rd_rkb_dbg(rkb, CGRP, "CGRPCOORD",
"Group \"%.*s\" coordinator is %s:%i id %"PRId32,
RD_KAFKAP_STR_PR(rkcg->rkcg_group_id),
mdb.host, mdb.port, mdb.id);
rd_rkb_dbg(rkb, BROKER, "CONNECT", "Connecting to %s (%s) "
"with socket %i",
rd_sockaddr2str(sinx, RD_SOCKADDR2STR_F_FAMILY |
RD_SOCKADDR2STR_F_PORT),
rd_kafka_secproto_names[rkb->rkb_proto], s);
in rd-kafka_broker_connect , nodename is copied. we might update rkb.rkb_nodename after the nodename get copied
static int rd_kafka_broker_connect (rd_kafka_broker_t *rkb) {
const rd_sockaddr_inx_t *sinx;
char errstr[512];
char nodename[RD_KAFKA_NODENAME_SIZE];
rd_bool_t reset_cached_addr = rd_false;
rd_rkb_dbg(rkb, BROKER, "CONNECT",
"broker in state %s connecting",
rd_kafka_broker_state_names[rkb->rkb_state]);
rd_atomic32_add(&rkb->rkb_c.connects, 1);
rd_kafka_broker_lock(rkb);
rd_strlcpy(nodename, rkb->rkb_nodename, sizeof(nodename)); // here the nodename is copied
#define RD_KAFKA_VERSION 0x010601ff
the code shows that it's version is 1.6.1
but I guess I should check code of branch v1.6.1, and you are right, there is no fix in v1.6.1.
so we are actually using a version without the fix.
any way , we would try to work with broker.address.ttl = 0 for the moment.
plz kindly let me known that if you got any conclusion on this.
@ajbarb just update you that setting broker.address.ttl to 0 do solve the problem, no stuck consumer anymore.
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
some consumer of a group get stuck after the broker experienced a leader re-election, the detailed process is described below:
from the cosumer's perspective
from the broker's perspetive
How to reproduce
can't reproduce this yet , but the issue keeps coming up .
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
kafka.LibraryVersion(): 17170943 1.6.1
< Ubuntu 7.4.0-1ubuntu1~16.04~ppa1 >
debug=..
as necessary) from librdkafka, log is appened belowconsumer logs
broker-2 log