Open dmazhar-cogniance opened 4 years ago
Hi @dmazhar-cogniance, thanks for reporting, will try to figure out how to handle this case.
As a work-around I have used this query for the wrong leaders number alert: sum(zk_server_leader * on(zk_host, <other needed labels...>) zk_version) by (<needed labels>) != 1
. This will fire if number of the leaders in ensemble will not be equal to 1. And will not be false-positive for the network-partitioned node, cause it has no zk_version
metric. Maybe this will be helpful to anybody, who will face the same issue.
Hi! Thanks for the exporter :) I have found something that looks like a bug. If zk node is network-partitioned from the quorum it responses with
This ZooKeeper instance is not currently serving requests
line to themntr
command. This response is processed on https://github.com/dabealu/zookeeper-exporter/blob/1f66c108f74e75f448d61823a841b08421634778/main.go#L60 andzk_server_leader
metric for this host is set to 1. So this node is considered a leader while it is not. I assume that this specific processing was done for the cases when zookeeper is configured to not serve requests from leader node. Looks like there is another edge case to be processed, but not sure how to distinguish partitioned node from master node which does not serve user requests :(