danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.16k stars 607 forks source link

Estimated lag duration #229

Open OuesFa opened 3 years ago

OuesFa commented 3 years ago

Using some metrics already provided by the exporter. I'm trying to estimate the lag duration i.e how much time does the consumer need in order to process all the late events.

Someone can maybe review my formula ?

sum(kafka_consumergroup_lag{consumergroup="my-consumer",topic="my-topic"}) by (consumergroup, topic) / sum(delta(kafka_consumergroup_current_offset{consumergroup="my-consumer",topic="my-topic"}[1m])/60) by (consumergroup, topic)

I use it in a grafana panel with seconds as left Y axis

Thanks a lot 🙏

christidis commented 3 years ago

did you see this post? https://github.com/danielqsj/kafka_exporter/issues/32#issuecomment-410127012

and if you need this by consumergroup, topic then it should be like

(sum(kafka_consumergroup_lag{instance="$instance",topic=~"$topic"})  by (consumergroup, topic)  ) / (-1 * sum(delta(kafka_consumergroup_lag{instance="$instance",topic=~"$topic"}[15m]) < 0)  by (consumergroup, topic) ) * 15
OuesFa commented 3 years ago

Thanks so much @christidis, no I didn't see the post. I don't really understand the <0 and the multiplication by 15 parts, I'm a promql newbie :)

christidis commented 3 years ago

the <0 is just a fail safe to ensure there is no division by zero I guess. The multiplication by 15 parts has to do with the differences over a 15m deltas used. CC @wulfuric who originally posted the queries.

Also note that in the latest exporter version you may also use the kafka_consumergroup_lag_sum over the sum(kafka_consumergroup_lag)

OuesFa commented 3 years ago

Thanks @christidis Hum I don't really understand @wulfuric 's query. What I would like to have is an estimation of the lag in duration, like this

Capture d’écran 2021-06-11 à 13 13 08

The first panel is based on the kafka_consumergroup_lag metric. sum(kafka_consumergroup_lag{consumergroup=~"$consumergroup",topic=~"$topic"}) by (consumergroup, topic) The second panel is based on my query, it seems to work correctly but I would like to know if it's accurate sum(kafka_consumergroup_lag{consumergroup=~"$consumergroup",topic=~"$topic"}) by (consumergroup, topic) / sum(delta(kafka_consumergroup_current_offset{consumergroup=~"$consumergroup",topic=~"[[topic]]"}[1m])/60) by (consumergroup, topic)

When I try @wulfuric 's query it gives me this

Capture d’écran 2021-06-11 à 13 23 12

I can't really interpret and I can't explain the query

sherifkayad commented 1 year ago

@OuesFa how did you solve that eventually? Did you manage to find out if your query was correct / accurate?

ryandutton commented 1 month ago

It would be great if a metric similar to kafka_consumergroup_group_lag_seconds could be exported.