braedon / prometheus-kafka-consumer-group-exporter

Prometheus Kafka Consumer Group Exporter
MIT License
73 stars 39 forks source link

Timeout must not be negative #40

Open gfelixc opened 4 years ago

gfelixc commented 4 years ago

Container fails sometimes with following error

Traceback (most recent call last):
  File "/usr/local/bin/prometheus-kafka-consumer-group-exporter", line 11, in <module>
    load_entry_point('prometheus-kafka-consumer-group-exporter', 'console_scripts', 'prometheus-kafka-consumer-group-exporter')()
  File "/usr/src/app/prometheus_kafka_consumer_group_exporter/__init__.py", line 165, in main
    for message in consumer:
  File "/usr/local/lib/python3.8/site-packages/kafka/consumer/group.py", line 1181, in __next__
    return self.next_v2()
  File "/usr/local/lib/python3.8/site-packages/kafka/consumer/group.py", line 1189, in next_v2
    return next(self._iterator)
  File "/usr/local/lib/python3.8/site-packages/kafka/consumer/group.py", line 1106, in _message_generator_v2
    record_map = self.poll(timeout_ms=timeout_ms, update_offsets=False)
  File "/usr/local/lib/python3.8/site-packages/kafka/consumer/group.py", line 635, in poll
    assert timeout_ms >= 0, 'Timeout must not be negative'
AssertionError: Timeout must not be negative

Any ideas?

braedon commented 4 years ago

Hi @gfelixc, that's not something I've seen, but it looks like it might be a bug in the kafka-python library.

In KafkaConsumer, next_v2() sets _consumer_timeout to some time in the future (based on consumer_timeout_ms), and then calls next() on _message_generator_v2() while _consumer_timeout hasn't been reached. _message_generator_v2() then subtracts the current time (time.time()) from _consumer_timeout to get the timeout_ms to pass to poll().

If too much time elapses between checking if _consumer_timeout has been reached and calculating the timeout_ms, it could end up being negative. It seems like _message_generator_v2() should check for this, and use 0 if it calculates a negativetimeout_ms.

Assuming my quick analysis is correct, would you be able to raise an issue (or PR) with kafka-python to get this fixed?

braedon commented 4 years ago

(Just to check - you haven't changed consumer_timeout_ms from the default 500ms, have you?)

gfelixc commented 4 years ago

Image has been deployed as is, no config changes. I'll raise to kafka-python as you suggest, and I'll let you know once fixed. Thanks a lot. Do you mind keep this ticket opened until kafka-python with fix would be updated?

braedon commented 4 years ago

No worries, I'll keep this open. If you could link to the kafka-python issue once created it'd be great.