Open na-ka-na opened 1 year ago
Our consumer loop is as follows:
from confluent_kafka import Consumer, KafkaException, Message
self.consumer = Consumer(self.conf, logger=self.logger)
self.consumer.subscribe(
topics=self.topics,
on_assign=self._print_topic_subscription_fn("Assigned"),
on_revoke=self._print_topic_subscription_fn("Revoked"),
on_lost=self._print_topic_subscription_fn("Lost"),
)
def _consume_loop(self, func: Callable[[List[Message]], None], batch_size: int = 1):
while True:
messages = self.consumer.consume(num_messages=batch_size, timeout=2.0)
if not messages:
continue
errored_messages = []
for message in messages:
if message.error():
errored_messages.append(message)
self.logger.error(
f"Client {self._client_id} got message with error, message={message}, "
f"error={message.error()}"
)
if errored_messages:
raise KafkaException(errored_messages[0].error())
if self.semantics == ConsumerSemantics.AT_MOST_ONCE:
self.consumer.commit()
with timeout(seconds=self.max_batch_processing_seconds):
func(messages)
if self.semantics == ConsumerSemantics.AT_LEAST_ONCE:
self.consumer.commit()
I managed to work around the issue. What I did was forced the consumer to consume one message from stuck partitions and now consumers are recognizing all partitions.
This is weird, not sure what the underlying issue is. But there is some problem surely
How exactly did you "forced the consumer to consume one message from stuck partitions and now consumers are recognizing all partitions.". Thank you!
We encountered the same issue, suspecting it to be a bug on the client side; But there is no strict evidence yet.
@na-ka-na or @SmartXingZhou , any update about this problem?
I'm having the same issue but using an C# Library v1.8.2
Description
We had 100 partitions and 100 consumers and things worked as expected. But last night I increased number of partitions to 1000 and number of consumers to 200.
As expected each of the 200 consumers got assigned 5 partitions each. I confirmed this from consumer logs and
./kafka-consumer-groups.sh --describe
Partitions numbered 0-99 were consumed fully as expected. But only about a fifth of partitions numbered 100-999 were consumed.
Either the partitions were fully consumed, or not a single message was consumed. It is as if the consumer didn't even "connect" or bother to consume from those partitions. It is really bizarre! E.g. see below:
How to reproduce
Don't know
Checklist
Please provide the following information:
confluent_kafka.version()
andconfluent_kafka.libversion()
):('1.8.2', 17302016)
and('1.8.2', 17302271)
respectively2.8.1 (Commit:839b886f9b732b15)
{...}
Linux 5.4.219-126.411.amzn2.x86_64
'debug': '..'
as necessary)