dabealu / zookeeper-exporter

zookeeper exporter for prometheus
MIT License
138 stars 74 forks source link

Wrong zk_server_leader value for network partitioned node #19

Open dmazhar-cogniance opened 4 years ago

dmazhar-cogniance commented 4 years ago

Hi! Thanks for the exporter :) I have found something that looks like a bug. If zk node is network-partitioned from the quorum it responses with This ZooKeeper instance is not currently serving requests line to the mntr command. This response is processed on https://github.com/dabealu/zookeeper-exporter/blob/1f66c108f74e75f448d61823a841b08421634778/main.go#L60 and zk_server_leader metric for this host is set to 1. So this node is considered a leader while it is not. I assume that this specific processing was done for the cases when zookeeper is configured to not serve requests from leader node. Looks like there is another edge case to be processed, but not sure how to distinguish partitioned node from master node which does not serve user requests :(

dabealu commented 4 years ago

Hi @dmazhar-cogniance, thanks for reporting, will try to figure out how to handle this case.

dmazhar-cogniance commented 4 years ago

As a work-around I have used this query for the wrong leaders number alert: sum(zk_server_leader * on(zk_host, <other needed labels...>) zk_version) by (<needed labels>) != 1. This will fire if number of the leaders in ensemble will not be equal to 1. And will not be false-positive for the network-partitioned node, cause it has no zk_version metric. Maybe this will be helpful to anybody, who will face the same issue.