danielqsj / kafka_exporter

Kafka exporter for Prometheus
Apache License 2.0
2.09k stars 602 forks source link

Fix Error Handling in Member Assignment to Ensure Seamless Processing #427

Open sciclon2 opened 3 months ago

sciclon2 commented 3 months ago

Summary

This pull request fixes an issue with error handling in member assignment, ensuring that data processing remains uninterrupted and error-free.

The code would fail and skip the pending metrics. For example, if the Confluent Schema Registry created a group named "schema-registry" but used the protocolType "sr", it would cause a silent failure, breaking the loop of groups to scrape the metrics.

danielqsj commented 3 months ago

According to https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-GroupMembershipAPI, the protocol type consume is not mandatory, right ?

danielqsj commented 3 months ago

Consumer groups: Below we define the embedded protocol used by consumer groups. We recommend all consumer implementations follow this format so that tooling will work correctly across all clients.

ProtocolType => "consumer" --
sciclon2 commented 3 months ago

Sure! Here's an improved version of your message:

Hey @danielqsj, you're absolutely right. I realized that I forgot to close this PR as I identified the root cause of the issue. Essentially, the problem lies in the handling of errors when we encounter them in assignment, err := member.GetMemberAssignment(). Currently when it returns an error we are prematurely returning instead of continuing the loop , which leads to potential issues such as skipping metrics randomly.

It's worth noting that errors can indeed arise in this context, as seen with the Confluent Schema Registry scenario I mentioned earlier in this PR summary. To address this, we should replace "return" with "continue" here:

if err != nil {
    klog.Warningf("Cannot get GetMemberAssignment of group member %v: %v", member, err)
    return // It should continue here instead
}

Making this adjustment will ensure that the process continues seamlessly even in the presence of errors, preventing any unexpected disruptions to the flow of metrics and associated functionalities.

We can see the error of sr-1-* members

Screenshot 2024-03-16 at 07 48 51