danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.18k stars 610 forks source link

kafka_exporter works intermittently (`/metrics` return 504 Gateway Time-out) #336

Closed DavSanchez-DPT closed 1 year ago

DavSanchez-DPT commented 2 years ago

Hi,

We currently have a kafka_exporter instance connected to one of our Kafka development clusters, we see there is a lot of scrape items (6 brokers with ~2500 partitions each, ~2000 consumer groups) and sometimes querying the /metrics endpoint throws a 504 after a minute (directly, with curl). Our Prometheus instance sometimes shows the Context deadline exceeded error and considers the kafka_exporter target to be down. At other intervals it is able to correctly output all the metrics, it keeps working and then timing-out intermittently.

$ time curl -X GET http://kafka-exporter.url/metrics
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>

real    1m0.047s
user    0m0.024s
sys     0m0.008s

This is the output of scrape_samples_scraped for the kafka_exporter instance over the last 12 hours:

image

I assume directly reducing the scrape items that our Kafka clusters have (currently close to 100k, though another kafka_exporter instance for a cluster with 80k samples is working correctly), but as it already has issues when querying directly, do you have any advice on where should we act? Perhaps there is a configurable item that we could tweak so the node takes less time to retrieve the metrics such as topic.workers? We have high resources dedicated to the instance where kafka_exporter is running, so it's not related to resources as in issue #305 unless there are some implicit limits to resource usage that we are not aware of.

kafka_exporter shows no error in logs, seems like it's just started and listening on the configured port. As said, the issue is it just works intermittently.

Thank you very much for your help!

DavSanchez-DPT commented 1 year ago

We finally solved this by reducing the number of scrape samples (restricting offset.retention.minutes in Kafka so the inactive Consumer Group information is deleted earlier) and increasing scrape timeouts for Prometheus so the exporter is able to get and expose the metrics on time.