Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat

ajwootto commented 5 years ago

Bug Report

Environment

Node version: 8
Kafka-node version: 4.1.3
Kafka version: 1.10

This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:

Given two consumers that have successfully connected to a broker and started heartbeats:

1. next heartbeat is currently scheduled
2. connect is called outside heartbeat loop (due to socket closed etc)
3. next heartbeat happens with rebalance error because of current reconnect
4. another reconnect is scheduled due to heartbeat error
5. first connect finishes
6. heartbeat interval is cleared and restarted
7. next heartbeat succeeds on the latest generation id
8. scheduled reconnect occurrs from previous heartbeat failure (outside context of current heartbeat loop, ie. from the old generation id)
GOTO 3.

Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.

To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.

https://github.com/taplytics/kafka-node/commit/8fd6b927985f78985976892fc0145e11ac6b2d20

Just set process.env.FAKE_CONNECT=1 for one consumer and not the other.

thynson commented 5 years ago

Looks like an issue I want to fix in PR. #1281

ajwootto commented 5 years ago

I don't think it's the same problem. In my case, there is only one topic involved.

SOHU-Co / kafka-node

Consumer enters rebalance loop when connect function is triggered during a scheduled heartbeat #1279

Bug Report

Environment