Closed DavSanchez-DPT closed 1 year ago
We finally solved this by reducing the number of scrape samples (restricting offset.retention.minutes
in Kafka so the inactive Consumer Group information is deleted earlier) and increasing scrape timeouts for Prometheus so the exporter is able to get and expose the metrics on time.
Hi,
We currently have a
kafka_exporter
instance connected to one of our Kafka development clusters, we see there is a lot of scrape items (6 brokers with ~2500 partitions each, ~2000 consumer groups) and sometimes querying the/metrics
endpoint throws a 504 after a minute (directly, withcurl
). Our Prometheus instance sometimes shows the Context deadline exceeded error and considers thekafka_exporter
target to be down. At other intervals it is able to correctly output all the metrics, it keeps working and then timing-out intermittently.This is the output of
scrape_samples_scraped
for thekafka_exporter
instance over the last 12 hours:I assume directly reducing the scrape items that our Kafka clusters have (currently close to 100k, though another
kafka_exporter
instance for a cluster with 80k samples is working correctly), but as it already has issues when querying directly, do you have any advice on where should we act? Perhaps there is a configurable item that we could tweak so the node takes less time to retrieve the metrics such astopic.workers
? We have high resources dedicated to the instance wherekafka_exporter
is running, so it's not related to resources as in issue #305 unless there are some implicit limits to resource usage that we are not aware of.kafka_exporter
shows no error in logs, seems like it's just started and listening on the configured port. As said, the issue is it just works intermittently.Thank you very much for your help!