braedon / prometheus-kafka-consumer-group-exporter

Prometheus Kafka Consumer Group Exporter
MIT License
73 stars 39 forks source link

Error while running the exporter #38

Closed chhetripradeep closed 3 years ago

chhetripradeep commented 4 years ago

Thank you for writing this awesome exporter. I recently tried upgrading the exporter from 0.2 to 0.5.5 along with the kafka-python to 1.4.7. Although most of the metrics were getting populated well, I got the following error: https://gist.githubusercontent.com/chhetripradeep/d294a90a1ee9871ea41a85cc5fed5566/raw/8c1e1a9957ffab6d5085b3f70c5ad525d1a54eb6/gistfile1.txt

I am running kafka v1.0.0 clusters. Thank you for the help.

braedon commented 4 years ago

Hi @chhetripradeep, Sorry to hear you're having issues.

The error log you linked is a timeout when fetching low-water marks. Do you get similar errors for other things, e.g. high-water marks? Are all fetches failing (i.e. every 10 seconds with the default config), or just occasionally?

I've tried that combination of exporter/kafka-python/kafka locally, and it seems to work fine, with no load at least. It's possibly a legitimate timeout fetching from your Kafka cluster. Are your brokers under significant load? Do you have a large number of topics? Have you changed any timeouts using the --consumer-config flag?

chhetripradeep commented 4 years ago

Hi @braedon , thank you for the quick reply.

Do you get similar errors for other things, e.g. high-water marks?

I don't see the errors while collecting high-water marks or any other metrics.

Are all fetches failing (i.e. every 10 seconds with the default config), or just occasionally?

Its failing for fetching lowwatermark metrics for few partitions of a topic eg: https://gist.githubusercontent.com/chhetripradeep/92303ddfba789c797b5de36c8fcac319/raw/064e4ed89483aa73d33f5a2bb80a35ddd66b2526/gistfile1.txt

Are your brokers under significant load?

Probably this is the reason.

Do you have a large number of topics?

I have around 20 topics with topics having around partitions ranging from 16 to 150 partitions.

Have you changed any timeouts using the --consumer-config flag?

Let me try changing this timeout flag and see how things go.

Thank you for the help Braedon. I will update the thread with the changes.

chhetripradeep commented 4 years ago

Hi @braedon I tried updating the timeout to 20 seconds but still it fails with the same error. I noticed it only happens in the large number of topic partitions cluster.

braedon commented 4 years ago

@chhetripradeep Which timeout did you set - there's a few haha.

Digging into the code a bit more, it looks like the timeout is occurring when sending (b'x') on the "wakeup" socket, not on a connection to Kafka. I guess the send buffer must be filling up somehow when processing the large response from Kafka.

The setting to control this timeout is wakeup_timeout_ms, which you can set in the config file with the key wakeup.timeout.ms. Can you try setting it to 20 seconds - it's 3 seconds by default.

chhetripradeep commented 4 years ago

Hi @braedon haha i set up consumer.timeout.ms timeout. nice catch. let me try doing that and see if the error stops. thanks a lot for your continuous help :)

chhetripradeep commented 4 years ago

Hi @braedon looks like there is no consumer config variable named wakeup_timeout_ms I think that config variable is defined by python kafka client only.

braedon commented 4 years ago

Ugg, I was hoping we'd be able to set it via the config file anyway - all those settings get passed to the KafkaConsumer, and through to the underlying KafkaClient - but it looks like KafkaConsumer validates the settings, and doesn't know about KafkaClient's wakeup_timeout_ms.

Looks like there isn't any way to pass settings KafkaConsumer doesn't know about to the underlying KafkaClient, so we'll need to construct a client ourself manually.

I've thrown together a quick branch - https://github.com/braedon/prometheus-kafka-consumer-group-exporter/pull/39 - can you try testing that code, or do you need a pypi/docker release?

chhetripradeep commented 4 years ago

Sorry @braedon for late response. I was on vacation. I will try the branch and let you know. Thanks for continuous support. Really very appreciated.

chhetripradeep commented 4 years ago

Hi @braedon i just tried your fix but looks like those timeouts are still happening - https://gist.githubusercontent.com/chhetripradeep/70246f8e17289b319747d3db4aa4f31d/raw/d5e504adabf8bed82d41c6ca6211cfae7c2d30c6/gistfile1.txt

braedon commented 4 years ago

Darn, I wonder if the client can't consume anything from the send buffer once it gets full, preventing it from clearing, even with a long timeout... Not sure.

I think I'll have to try and replicate it locally, and do some debugging.

chhetripradeep commented 4 years ago

Hello @braedon Can you think of any workaround for this issue. If you can guide with some pointers to look at, it will be great.

braedon commented 4 years ago

Hi @chhetripradeep, I haven't been able to replicate the issue you're seeing locally. Even with 25k partitions on a single node, and significant read and write load, the exporter is still able to run with no issues. There's a bunch factors that make my tests different to your case - exporter on the same computer, single Kafka node, etc. - but I'm rather limited in what I can set up for testing currently.

As a possible workaround, you could try increasing the refresh interval for topic/high-water/low-water info, using the --topic-interval/--high-water-interval/--low-water-interval options, to see if spacing requests out more helps.

Beyond that, you could try debugging the kafka-python code yourself - particularly looking at what's happening with the wakeup socket. e.g. is it actually timing out because it's filling up with too many wakeup requests (if so, what exactly is generating them), or is something else causing the timeout.

You could try and replicate it with a simplified script that makes the same calls to kafka-python, but eliminates the rest of the exporter. If you can replicate it there, you could try raising an issue with kafka-python.

braedon commented 3 years ago

Closing as I couldn't replicate the issue, and it's been quite a while. If anyone else is running into the same issue, please re-open.