This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:
Given two consumers that have successfully connected to a broker and started heartbeats:
1. next heartbeat is currently scheduled
2. connect is called outside heartbeat loop (due to socket closed etc)
3. next heartbeat happens with rebalance error because of current reconnect
4. another reconnect is scheduled due to heartbeat error
5. first connect finishes
6. heartbeat interval is cleared and restarted
7. next heartbeat succeeds on the latest generation id
8. scheduled reconnect occurrs from previous heartbeat failure (outside context of current heartbeat loop, ie. from the old generation id)
GOTO 3.
Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.
To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.
Bug Report
Environment
This is a bit of an edge case but we've run into it pretty consistently with our setup. The logical steps are as follows:
Basically the problem seems to be kicked off by connect() getting called from some mechanism other than a heartbeat failure (in this case a socket close event, which triggers a reconnect). Since this process does not cancel the heartbeat interval, it is possible that the scheduled heartbeat can occur during the connection (rebalance) process. In this case, the heartbeat receives error code 27 and triggers a rebalance, thus scheduling another connection for 1 second in the future. Assuming the first connect() call finishes in time, it will start a new heartbeat loop but not clear the currently scheduled reconnect. One second later the reconnect occurs, but the latest heartbeat loop is still scheduled and will receive error code 27 on its next request, triggering another reconnect and so on.
To simulate this problem, I added some code to the consumerGroup that calls connect() a few times one second apart. This is enough to throw it into a loop when running with two consumers against my local Kafka.
https://github.com/taplytics/kafka-node/commit/8fd6b927985f78985976892fc0145e11ac6b2d20
Just set process.env.FAKE_CONNECT=1 for one consumer and not the other.