confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
306 stars 3.15k forks source link

some consumer of a group get stuck after the broker experienced leader re-election #3396

Closed uglycow closed 2 years ago

uglycow commented 3 years ago

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

some consumer of a group get stuck after the broker experienced a leader re-election, the detailed process is described below:

from the cosumer's perspective

  1. the consumer receive "Not coordinator" error when doing heartbeat and switched between broker-2 and broker-3 for some times
  2. finnaly most of the consumer is ok but some consumer just can't recover
  3. the faulty consumer would loops in
    • send FindCoordinatorRequest , the broker would answer it with " broker-2 is the coordinator"
    • send JoinGroup request to broker-2, and get response with "Not coordinator" error
    • send FindCoordinatorRequest again, the broker would answer it with " broker-2 is the coordinator" again
    • send JoinGroup request to broker-2, and get response with "Not coordinator" error ...

from the broker's perspetive

  1. I got a 3 node kafka cluster , and somehow session timeout happened between broker-2 and zookeeper, this triggered a leader re-election for the partitions owned by broker-2 before.
  2. then broker-2 re-connected to zookeeper, this trigger leader re-election again,
  3. actually broker-2 and broker-3 is elect as leader in turn for some times
  4. finally, the system is stable with broker-2 as the leader.
    • broker-2 take 12min to load the offset and metadata from the internal partition __consumer_offset-47. since broker reject JoinGroup request with error when it's loading metadata , we are not surprised that some consumer is trapped in the error loop during these 12 mins, but they just keep looping forever.
      1. we could restart the faulty client to resolve this .

How to reproduce

can't reproduce this yet , but the issue keeps coming up .

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

        cfg = &kafka.ConfigMap{
            "bootstrap.servers":  strings.Join(brokers, ","),
            "enable.auto.commit": false,
            "group.id":           groupID,
            "auto.offset.reset":  "earliest",
            "client.id":          uuid.New().String(),
            "session.timeout.ms": 1000 * 30,
            "debug":              "generic,cgrp",
        }

consumer logs

2021-05-18 00:10:04|1621267804.145|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:07|1621267807.145|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:13|1621267813.145|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:16|1621267816.145|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:19|1621267819.145|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:19|1621267819.146|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" heartbeat error response in state up (join-state steady, 16 partition(s) assigned): Broker: Not coordinator
2021-05-18 00:10:19|1621267819.146|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:10:19|1621267819.578|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.18:9092 id 3
2021-05-18 00:10:19|1621267819.578|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changing coordinator 2 -> 3
2021-05-18 00:10:19|1621267819.578|COORDCLEAR|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" broker 10.70.23.17:9092/2 is no longer coordinator
2021-05-18 00:10:19|1621267819.578|COORDSET|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" coordinator set to broker 10.70.23.18:9092/3
2021-05-18 00:10:19|1621267819.578|CGRPSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed state up -> wait-broker-transport (join-state steady)
2021-05-18 00:10:19|1621267819.578|BROADCAST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Broadcasting state change
2021-05-18 00:10:19|1621267819.578|CGRPSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed state wait-broker-transport -> up (join-state steady)

2021-05-18 00:10:34|1621267834.147|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" heartbeat error response in state up (join-state steady, 16 partition(s) assigned): Broker: Not coordinator
2021-05-18 00:10:34|1621267834.147|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.16:9092/1: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:10:34|1621267834.156|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.16:9092/1: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2
2021-05-18 00:10:34|1621267834.156|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changing coordinator 3 -> 2
2021-05-18 00:10:34|1621267834.156|COORDCLEAR|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" broker 10.70.23.18:9092/3 is no longer coordinator
2021-05-18 00:10:34|1621267834.156|COORDSET|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" coordinator set to broker 10.70.23.17:9092/2
2021-05-18 00:10:34|1621267834.156|CGRPSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed state up -> wait-broker-transport (join-state steady)

2021-05-18 00:10:40|1621267840.147|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" heartbeat error response in state up (join-state steady, 16 partition(s) assigned): Broker: Not coordinator
2021-05-18 00:10:40|1621267840.147|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:10:40|1621267840.583|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2
2021-05-18 00:10:43|1621267843.146|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: GroupCoordinator/2: Heartbeat for group "kale_scheduler_local" generation id 2141
2021-05-18 00:10:43|1621267843.147|HEARTBEAT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" heartbeat error response in state up (join-state steady, 16 partition(s) assigned): Broker: Not coordinator

2021-05-18 00:11:01%4|1621267861.586|SESSTMOUT|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Consumer group session timed out (in join-state steady) after 30438 ms without a successful response from the group coordinator (broker 2, last error was Broker: Not coordinator): revoking assignment and rejoining group
2021-05-18 00:11:01|1621267861.586|MEMBERID|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": updating member id "4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7-6449a546-b8b0-47fe-9020-f2e3fe40fabb" -> ""
2021-05-18 00:11:01|1621267861.586|REBALANCE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" initiating rebalance (EAGER) in state up (join-state steady) with 16 assigned partition(s) (lost): Consumer group session timed out (in join-state steady) after 30438 ms without a successful response from the group coordinator (broker 2, last error was Broker: Not coordinator)
2021-05-18 00:11:01|1621267861.586|LOST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": current assignment of 16 partition(s) lost: Consumer group session timed out (in join-state steady) after 30438 ms without a successful response from the group coordinator (broker 2, last error was Broker: Not coordinator): revoking assignment and rejoining
2021-05-18 00:11:01|1621267861.586|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state steady -> wait-unassign-call (state up)

21-05-18 00:34:49|1621269289.602|BROADCAST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:GroupCoordinator]: Broadcasting state change
2021-05-18 00:34:49|1621269289.602|JOINGROUP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: JoinGroup response: GenerationId -1, Protocol , LeaderId , my MemberId , member metadata count 0: Broker: Not coordinator
2021-05-18 00:34:49|1621269289.602|REJOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": Rejoining group without an assignment: JoinGroup error: Broker: Not coordinator
2021-05-18 00:34:49|1621269289.602|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state wait-join -> init (state up)
2021-05-18 00:34:49|1621269289.602|CGRPOP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" received op COORD_QUERY in state up (join-state init)
2021-05-18 00:34:49|1621269289.602|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:34:49|1621269289.603|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2
2021-05-18 00:34:51|1621269291.602|JOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": join with 1 subscribed topic(s)
2021-05-18 00:34:51|1621269291.602|CGRPMETADATA|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: consumer join: metadata for subscription is up to date (146246ms old)
2021-05-18 00:34:51|1621269291.602|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state init -> wait-join (state up)
2021-05-18 00:34:51|1621269291.602|BROADCAST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:GroupCoordinator]: Broadcasting state change
2021-05-18 00:34:51|1621269291.603|JOINGROUP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: JoinGroup response: GenerationId -1, Protocol , LeaderId , my MemberId , member metadata count 0: Broker: Not coordinator
2021-05-18 00:34:51|1621269291.603|REJOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": Rejoining group without an assignment: JoinGroup error: Broker: Not coordinator
2021-05-18 00:34:51|1621269291.603|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state wait-join -> init (state up)
2021-05-18 00:34:51|1621269291.603|CGRPOP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" received op COORD_QUERY in state up (join-state init)
2021-05-18 00:34:51|1621269291.603|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:34:51|1621269291.603|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2
2021-05-18 00:34:53|1621269293.602|JOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": join with 1 subscribed topic(s)
2021-05-18 00:34:53|1621269293.602|CGRPMETADATA|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: consumer join: metadata for subscription is up to date (148246ms old)
2021-05-18 00:34:53|1621269293.602|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state init -> wait-join (state up)
2021-05-18 00:34:53|1621269293.602|BROADCAST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:GroupCoordinator]: Broadcasting state change
2021-05-18 00:34:53|1621269293.603|JOINGROUP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: JoinGroup response: GenerationId -1, Protocol , LeaderId , my MemberId , member metadata count 0: Broker: Not coordinator
2021-05-18 00:34:53|1621269293.603|REJOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": Rejoining group without an assignment: JoinGroup error: Broker: Not coordinator
2021-05-18 00:34:53|1621269293.603|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state wait-join -> init (state up)
2021-05-18 00:34:53|1621269293.603|CGRPOP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" received op COORD_QUERY in state up (join-state init)
2021-05-18 00:34:53|1621269293.603|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:34:53|1621269293.603|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2
2021-05-18 00:34:55|1621269295.602|JOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": join with 1 subscribed topic(s)
2021-05-18 00:34:55|1621269295.602|CGRPMETADATA|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: consumer join: metadata for subscription is up to date (150246ms old)
2021-05-18 00:34:55|1621269295.602|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state init -> wait-join (state up)
2021-05-18 00:34:55|1621269295.602|BROADCAST|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:GroupCoordinator]: Broadcasting state change
2021-05-18 00:34:55|1621269295.603|JOINGROUP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: JoinGroup response: GenerationId -1, Protocol , LeaderId , my MemberId , member metadata count 0: Broker: Not coordinator
2021-05-18 00:34:55|1621269295.603|REJOIN|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local": Rejoining group without an assignment: JoinGroup error: Broker: Not coordinator
2021-05-18 00:34:55|1621269295.603|CGRPJOINSTATE|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" changed join state wait-join -> init (state up)
2021-05-18 00:34:55|1621269295.603|CGRPOP|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: Group "kale_scheduler_local" received op COORD_QUERY in state up (join-state init)
2021-05-18 00:34:55|1621269295.603|CGRPQUERY|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local": querying for coordinator: Broker: Not coordinator
2021-05-18 00:34:55|1621269295.603|CGRPCOORD|4a3fdbe0-dc6d-43e8-95d8-86de4aaccfe7#consumer-2| [thrd:main]: 10.70.23.17:9092/2: Group "kale_scheduler_local" coordinator is 10.70.23.17:9092 id 2

broker-2 log

Ä2021-05-18 00:06:42,144Å INFO Deleted time index /var/lib/kafka/data/kale.coin.alert.rt-15/00000000000184047762.timeindex.deleted. (kafka.log.LogSegment)
Ä2021-05-18 00:10:13,066Å WARN Client session timed out, have not heard from server in 7613ms for sessionid 0x20d9c3d684a0030 (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,066Å INFO Client session timed out, have not heard from server in 7613ms for sessionid 0x20d9c3d684a0030, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,068Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:24807-280979 (kafka.network.Processor)
Ä2021-05-18 00:10:13,069Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.231:29029-281319 (kafka.network.Processor)
Ä2021-05-18 00:10:13,081Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.231:1852-280883 (kafka.network.Processor)
Ä2021-05-18 00:10:13,082Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:56342-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,083Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:28726-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,083Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:59100-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,083Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.231:44145-281549 (kafka.network.Processor)
Ä2021-05-18 00:10:13,083Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:18210-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,083Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.231:10816-281549 (kafka.network.Processor)
Ä2021-05-18 00:10:13,089Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.231:14973-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,089Å WARN Attempting to send response via channel for which there is no open connection, connection id 10.70.23.17:9092-10.70.38.230:17440-281548 (kafka.network.Processor)
Ä2021-05-18 00:10:13,452Å INFO Opening socket connection to server 10.70.23.17/10.70.23.17:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,453Å INFO Socket connection established, initiating session, client: /10.70.23.17:25358, server: 10.70.23.17/10.70.23.17:2181 (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,454Å WARN Unable to reconnect to ZooKeeper service, session 0x20d9c3d684a0030 has expired (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,454Å INFO Unable to reconnect to ZooKeeper service, session 0x20d9c3d684a0030 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,454Å INFO EventThread shut down for session: 0x20d9c3d684a0030 (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,455Å INFO ÄZooKeeperClient Kafka serverÅ Session expired. (kafka.zookeeper.ZooKeeperClient)
Ä2021-05-18 00:10:13,456Å INFO ÄZooKeeperClient Kafka serverÅ Initializing a new session to 10.70.23.16:2181,10.70.23.17:2181,10.70.23.18:2181. (kafka.zookeeper.ZooKeeperClient)
Ä2021-05-18 00:10:13,456Å INFO Initiating client connection, connectString=10.70.23.16:2181,10.70.23.17:2181,10.70.23.18:2181 sessionTimeout=6000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$É5c530d1e (org.apache.zookeeper.ZooKeeper)
Ä2021-05-18 00:10:13,457Å INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket)
Ä2021-05-18 00:10:13,457Å INFO zookeeper.request.timeout value is 0. feature enabled= (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,464Å INFO Opening socket connection to server 10.70.23.18/10.70.23.18:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,472Å INFO Socket connection established, initiating session, client: /10.70.23.17:21720, server: 10.70.23.18/10.70.23.18:2181 (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,473Å INFO Creating /brokers/ids/2 (is it secure? false) (kafka.zk.KafkaZkClient)
Ä2021-05-18 00:10:13,478Å INFO Session establishment complete on server 10.70.23.18/10.70.23.18:2181, sessionid = 0x30413496fdd0028, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
Ä2021-05-18 00:10:13,487Å INFO Stat of the created znode at /brokers/ids/2 is: 47245341967,47245341967,1621267813478,1621267813478,1,0,0,217319888150921256,313,0,47245341967
 (kafka.zk.KafkaZkClient)
Ä2021-05-18 00:10:13,488Å INFO Registered broker 2 at path /brokers/ids/2 with addresses: ArrayBuffer(EndPoint(10.70.23.17,9093,ListenerName(SASL_PLAINTEXT),SASL_PLAINTEXT), EndPoint(10.70.23.17,9092,ListenerName(EXTERNAL),PLAINTEXT), EndPoint(10.70.23.17,9091,ListenerName(INTERNAL),PLAINTEXT)), czxid (broker epoch): 47245341967 (kafka.zk.KafkaZkClient)

Ä2021-05-18 00:10:16,262Å ERROR ÄMetadataCache brokerId=2Å Listeners are not identical across brokers: Map(2 -> Map(ListenerName(EXTERNAL) -> 10.70.23.17:9092 (id: 2 rack: null), ListenerName(INTERNAL) -> 10.70.23.17:9091 (id: 2 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.17:9093 (id: 2 rack: null)), 3 -> Map(ListenerName(EXTERNAL) -> 10.70.23.18:9092 (id: 3 rack: null), ListenerName(INTERNAL) -> 10.70.23.18:9091 (id: 3 rack: null)), 1 -> Map(ListenerName(EXTERNAL) -> 10.70.23.16:9092 (id: 1 rack: null), ListenerName(INTERNAL) -> 10.70.23.16:9091 (id: 1 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.16:9093 (id: 1 rack: null))) (kafka.server.MetadataCache)

[2021-05-18 00:10:16,642] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Truncating partition kale.entry_test.test.de-3 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:16,642] INFO [Log partition=kale.entry_test.test.de-3, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
[2021-05-18 00:10:16,642] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Truncating partition kale.marketplace.order_processing.auto_cancel.fulfilment_cancel.cl.dlq-10 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:16,642] INFO [Log partition=kale.marketplace.order_processing.auto_cancel.fulfilment_cancel.cl.dlq-10, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
[2021-05-18 00:10:16,642] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Truncating partition kale.marketplace.order_processing.auto_cancel.fulfilment_cancel.co.dlq-4 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:16,642] INFO [Log partition=kale.marketplace.order_processing.auto_cancel.fulfilment_cancel.co.dlq-4, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
[2021-05-18 00:10:16,642] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Truncating partition kale.marketplace.order_accounting.batch_processing.de-10 to local high watermark 0 (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:16,642] INFO [Log partition=kale.marketplace.order_accounting.batch_processing.de-10, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-21 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-46 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-21. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-40 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-46. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-34 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-40. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-34. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-49 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-49. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-28 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-28. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-38 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-22 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-17 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-38. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-22. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-17. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-32 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-32. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-11 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-11. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-5 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-5. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-48 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-48. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-47 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-42 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:16,644] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-41 (kafka.coordinator.group.GroupMetadataManager)

[2021-05-18 00:10:17,773] INFO [Log partition=kale.marketplace.order_processing.order_dispatching.sg.dlq-14, dir=/var/lib/kafka/data] Truncating to 2940 has no effect as the largest offset in the log is 2939 (kafka.log.Log)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.govtech.noti_center.sg.dlq-13 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.auto_cancel.fulfilment_cancel.tw.dlq-13 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.cancellation.my.dlq-14 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition __consumer_offsets-47 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_accounting.state_machine.rt-12 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.order_dispatching.my.dlq-13 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_accounting.invoice_file.ph.dlq-13 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.return.return_processing.de-14 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.cb_collection.order_sync.br.dlq-5 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.order_dispatching.id.dlq-12 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.cancellation.mx.dlq-12 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-05-18 00:10:17,773] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition kale.marketplace.order_processing.cb_collection.order_sync.my.dlq-8 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)

[2021-05-18 00:10:36,577] INFO [Partition __consumer_offsets-47 broker=2] __consumer_offsets-47 starts at Leader Epoch 275 from offset 1413900873. Previous Leader Epoch was: 274 (kafka.cluster.Partition)
[2021-05-18 00:10:36,585] INFO [Partition kale.marketplace.order_processing.cod_screening.tw.dlq-3 broker=2] kale.marketplace.order_processing.cod_screening.tw.dlq-3 starts at Leader Epoch 265 from offset 1662. Previous Leader Epoch was: 264 (kafka.cluster.Partition)
[2021-05-18 00:10:36,593] INFO [Partition kale.marketplace.order_processing.order_dispatching.id.dlq-14 broker=2] kale.marketplace.order_processing.order_dispatching.id.dlq-14 starts at Leader Epoch 251 from offset 61386. Previous Leader Epoch was: 250 (kafka.cluster.Partition)
[2021-05-18 00:10:36,602] INFO [Partition kale.marketplace.order_accounting.invoice_file.ph.dlq-1 broker=2] kale.marketplace.order_accounting.invoice_file.ph.dlq-1 starts at Leader Epoch 179 from offset 5424. Previous Leader Epoch was: 178 (kafka.cluster.Partition)
[2021-05-18 00:10:36,610] INFO [Partition kale.marketplace.order_accounting.tax_service.de-8 broker=2] kale.marketplace.order_accounting.tax_service.de-8 starts at Leader Epoch 89 from offset 266. Previous Leader Epoch was: 88 (kafka.cluster.Partition)
[2021-05-18 00:10:36,618] INFO [Partition kale.marketplace.return.return_noti.rt-12 broker=2] kale.marketplace.return.return_noti.rt-12 starts at Leader Epoch 253 from offset 2328825. Previous Leader Epoch was: 252 (kafka.cluster.Partition)
[2021-05-18 00:10:36,626] INFO [Partition test-autocreation09-8 broker=2] test-autocreation09-8 starts at Leader Epoch 269 from offset 0. Previous Leader Epoch was: 268 (kafka.cluster.Partition)
[2021-05-18 00:10:36,635] INFO [Partition kale.marketplace.order_processing.order_logistics.post_upd_hdlr.tw.dlq-6 broker=2] kale.marketplace.order_processing.order_logistics.post_upd_hdlr.tw.dlq-6 starts at Leader Epoch 268 from offset 16. Previous Leader Epoch was: 267 (kafka.cluster.Partition)
[2021-05-18 00:10:36,643] INFO [Partition kale.email.consumemq.rt-12 broker=2] kale.email.consumemq.rt-12 starts at Leader Epoch 271 from offset 142056. Previous Leader Epoch was: 270 (kafka.cluster.Partition)

[2021-05-18 00:10:38,046] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-23 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,046] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-2 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,046] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-47 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,046] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-44 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,046] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-41 (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,051] ERROR [MetadataCache brokerId=2] Listeners are not identical across brokers: Map(2 -> Map(ListenerName(EXTERNAL) -> 10.70.23.17:9092 (id: 2 rack: null), ListenerName(INTERNAL) -> 10.70.23.17:9091 (id: 2 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.17:9093 (id: 2 rack: null)), 3 -> Map(ListenerName(EXTERNAL) -> 10.70.23.18:9092 (id: 3 rack: null), ListenerName(INTERNAL) -> 10.70.23.18:9091 (id: 3 rack: null)), 1 -> Map(ListenerName(EXTERNAL) -> 10.70.23.16:9092 (id: 1 rack: null), ListenerName(INTERNAL) -> 10.70.23.16:9091 (id: 1 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.16:9093 (id: 1 rack: null))) (kafka.server.MetadataCache)
[2021-05-18 00:10:38,059] ERROR [MetadataCache brokerId=2] Listeners are not identical across brokers: Map(2 -> Map(ListenerName(EXTERNAL) -> 10.70.23.17:9092 (id: 2 rack: null), ListenerName(INTERNAL) -> 10.70.23.17:9091 (id: 2 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.17:9093 (id: 2 rack: null)), 3 -> Map(ListenerName(EXTERNAL) -> 10.70.23.18:9092 (id: 3 rack: null), ListenerName(INTERNAL) -> 10.70.23.18:9091 (id: 3 rack: null)), 1 -> Map(ListenerName(EXTERNAL) -> 10.70.23.16:9092 (id: 1 rack: null), ListenerName(INTERNAL) -> 10.70.23.16:9091 (id: 1 rack: null), ListenerName(SASL_PLAINTEXT) -> 10.70.23.16:9093 (id: 1 rack: null))) (kafka.server.MetadataCache)
[2021-05-18 00:10:38,061] INFO [GroupMetadataManager brokerId=2] Finished loading offsets and group metadata from __consumer_offsets-20 in 15 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:10:38,062] INFO [GroupMetadataManager brokerId=2] Finished loading offsets and group metadata from __consumer_offsets-38 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)

[2021-05-18 00:24:23,492] INFO [GroupCoordinator 2]: Loading group metadata for kale_scheduler_local with generation 2141 (kafka.coordinator.group.GroupCoordinator)
[2021-05-18 00:24:23,492] INFO [GroupMetadataManager brokerId=2] Finished loading offsets and group metadata from __consumer_offsets-47 in 825402 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:24:23,492] INFO [GroupMetadataManager brokerId=2] Finished loading offsets and group metadata from __consumer_offsets-44 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:24:23,492] INFO [GroupMetadataManager brokerId=2] Finished loading offsets and group metadata from __consumer_offsets-41 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:24:23,492] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2021-05-18 00:24:53,493] INFO [GroupCoordinator 2]: Member 8fcfcbe4-747e-4423-8fbc-48ecad42784a-e4100be9-798b-4b46-a70c-1c9902f21045 in group kale_scheduler_local has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2021-05-18 00:24:53,493] INFO [GroupCoordinator 2]: Preparing to rebalance group kale_scheduler_local in state PreparingRebalance with old generation 2141 (__consumer_offsets-47) (reason: removing member 8fcfcbe4-747e-4423-8fbc-48ecad42784a-e4100be9-798b-4b46-a70c-1c9902f21045 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
uglycow commented 3 years ago

I'think we encountered the same issue as described in #2944 , all the log matches. but we didn't rolling update the brokers. the brokers experienced leader re-election because one of the broker disconnected from zookeeper.

uglycow commented 3 years ago

we kept encountering this error while we have been using the fixed version ( librdkafka-1.6.1 )

So I checked the fix #3238 , I think the error still got a chance to arise.

because rkb_nodename_epoch are incremented and used in two different thread, and two increment might be digested with one assingment in rd_kafka_broker_connect, so the second connect might use the resolving result of the first connect attempt.

a simple sequence-diagram to show the case : image

still not 100% sure about it. @edenhill @ajbarb would you plz take a look at this.

ajbarb commented 3 years ago

I don't see the v1.6.1 having the fix: https://github.com/edenhill/librdkafka/blob/v1.6.1/src/rdkafka_broker.c#L2103

Can you try setting broker.address.ttl to 0 and see if this solves the problem? The rkb_nodename_epoch is updated after rkb_nodename. In that case even if the first rd_kafka_broker_connect() used the 2nd/aggregate rkb_nodename_epoch it should connect to the updated rkb_nodename.

Also to confirm if this is using invalid broker ip for cgrp, correlate these two log messages from:

  1. rd_kafka_cgrp_handle_FindCoordinator()
rd_rkb_dbg(rkb, CGRP, "CGRPCOORD", 
 "Group \"%.*s\" coordinator is %s:%i id %"PRId32,                   
RD_KAFKAP_STR_PR(rkcg->rkcg_group_id),                   
mdb.host, mdb.port, mdb.id);
  1. rd_kafka_transport_connect()
    rd_rkb_dbg(rkb, BROKER, "CONNECT", "Connecting to %s (%s) "
    "with socket %i",           
    rd_sockaddr2str(sinx, RD_SOCKADDR2STR_F_FAMILY |
                   RD_SOCKADDR2STR_F_PORT),
    rd_kafka_secproto_names[rkb->rkb_proto], s);
uglycow commented 3 years ago
  1. we just release a version with broker.address.ttl = 0.
  2. in rd-kafka_broker_connect , nodename is copied. we might update rkb.rkb_nodename after the nodename get copied

    static int rd_kafka_broker_connect (rd_kafka_broker_t *rkb) {
    const rd_sockaddr_inx_t *sinx;
    char errstr[512];
        char nodename[RD_KAFKA_NODENAME_SIZE];
        rd_bool_t reset_cached_addr = rd_false;
    
    rd_rkb_dbg(rkb, BROKER, "CONNECT",
        "broker in state %s connecting",
        rd_kafka_broker_state_names[rkb->rkb_state]);
    
        rd_atomic32_add(&rkb->rkb_c.connects, 1);
    
        rd_kafka_broker_lock(rkb);
    
        rd_strlcpy(nodename, rkb->rkb_nodename, sizeof(nodename));  // here the nodename is copied
  3. I think I made a mistake about the version ,
    I was checking code of master branch at this commit image
    #define RD_KAFKA_VERSION  0x010601ff

    the code shows that it's version is 1.6.1

but I guess I should check code of branch v1.6.1, and you are right, there is no fix in v1.6.1.

so we are actually using a version without the fix.

any way , we would try to work with broker.address.ttl = 0 for the moment.
plz kindly let me known that if you got any conclusion on this.

uglycow commented 3 years ago

@ajbarb just update you that setting broker.address.ttl to 0 do solve the problem, no stuck consumer anymore.