danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.18k stars 610 forks source link

Unable to get any kafka_consumergroup metrics from Kafka exporter #409

Closed Mallikarjunradware closed 4 months ago

Mallikarjunradware commented 1 year ago

We are running kafka exporter and prometheus-to-sd containers on single pod on GKE. Till 13th Sep, exporter was working fine, suddenly it stopped exporting consumer related metrics(consumer lag, consumer member etc).

Below results without passing argument " --group.filter='.+' "

image

Below results after adding argument " --group.filter='.+' " but it dont have consumer metrics

image

Please find the exporter details.

Please note, same exporter is working on other 3 kafka clusters and getting lag details.

So far I have taken below actions:

Please help me if anyone has come across the same issue and able to overcome.

hellorill commented 8 months ago

We have the same problem. Previously, everything worked correctly, until the partitions of most topics were rebalanced. It looks like the processing of all responses is not entirely correct.

https://github.com/danielqsj/kafka_exporter/blob/b66d284be28b53fe37ca472029fefa4a521d9f6e/kafka_exporter.go#L593-L595

If you look at the values of the group.GroupId/group.GroupMembers/group variables, you can see, that their values may differ. Some of these values in our case:

...
Group id: test-group
Group members: map[]
Full group: &{0 kafka server: Request was for a consumer group that is not coordinated by this broker 16 test-group    map[] 0}
...
Group id: test-group
Group members: map[]
Full group: &{0 kafka server: Request was for a consumer group that is not coordinated by this broker 16 test-group    map[] 0}
...
Group id: test-group
Group members: map[<some valid data>]
Full group: &{0 kafka server: Not an error, why are you printing me? <some valid data>}
...

Most likely this leads to the following errors:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup" value:"test-group"} gauge:{value:8}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup" value:"test-group"} gauge:{value:0}} was collected before with the same name and label values
hellorill commented 8 months ago

Looked more detail, in the GroupDescription structure there are Err/ErrorCode fields, that are not checked by the exporter for errors in the Kafka response. Therefore, the exporter always believes, that the answer is correct, which sometimes leads to collisions in the metric. https://github.com/danielqsj/kafka_exporter/blob/b66d284be28b53fe37ca472029fefa4a521d9f6e/kafka_exporter.go#L571-L595 https://github.com/danielqsj/kafka_exporter/blob/b66d284be28b53fe37ca472029fefa4a521d9f6e/vendor/github.com/Shopify/sarama/describe_groups_response.go#L78-L81

hellorill commented 8 months ago

Actually, adding the following check resolves the metric error.

...
        for _, group := range describeGroups.Groups {
            if group.Err != 0 {
                continue
            }

            offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: 1}
            if e.offsetShowAll {
                for topic, partitions := range offset {
...
danielqsj commented 4 months ago

Closed by https://github.com/danielqsj/kafka_exporter/pull/441

PedroOrona commented 3 weeks ago

We had the same problem, and even when we upgraded the exporter to version 1.8.0 (which incorporates the fixes on PR #441) we continued not getting the kafka_consumergroup metrics. The temporary fix was to restart Kafka cluster.

Can someone help identify the real problem here?