Open nishant-dash opened 1 month ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5626.
This message was autogenerated
Actually, I've been going over the absent function from Prometheus, and it will only fire on a vector size 0 essentially. so if you do something very generic, it wont help. You need to target specific units with the absent metric
for example: Say you have 3 zookeepers and 1 went down for a few mins at 1pm.
absent(up{juju_application="zookeeper",juju_model="kafka-n")
wont fire because at 1pm, 2/3 units were up so the vector size return by the up clause inside absent is of size 2 (instead of 3) and wont fire since its not size 0.
However, this will fire at 1 pm since you go from having a vector size 1 to vector size 0.
absent(up{juju_application="zookeeper",juju_model="kafka-n",juju_unit="zookeeper/0"})
Above, by vector I mean the list of metrics.
So I'll replay what I understand, and hopefully you can let me know what is being requested here so we can get it resolved.
I believe you are communicating several separate issues:
up
alert is flappy and firing too oftenI believe that this Issue
pertains to point 1. and point 2., but your second comment above pertains to point 3.. Is this an accurate summary?
For point 1: Prometheus automatically adds the up
metric when scraping from each target, so if there is no up
metric at all, then afaik you don't have Prometheus, so I'm not sure absent
helps here. I'm not a Prom master so I'm happy to be corrected, but I would have expected up{} == 0
to be fairly standard. Could you detail out what behavior you would expect to see that isn't part of the current implementation?
For point 2: upstream ZK updated the docs to set the period to 1m
, so we can definitely do that - have opened a quick PR which will be used for the other parts of this Issue also.
For point 3: As you mentioned in the above comment, it's tricky for us to implement blanket, as ZooKeeper can (in theory) have an arbitrary number of server members within the ensemble, so to have an appropriate alert, we would need to be aware of the total expected quorum members 'ahead' of time, in order to alert on Q < (N+1)/2
active members. If you have a fixed number N
of servers deployed, may I suggest implementing your own custom alerts with your desired behavior specific to your deployments? For example, something similar to one of these, where Q
is your minimum quorum size:
- alert: ZooKeeper ensemble at minimum server membership
expr: sum(up{juju_application="zookeeper",juju_model="kafka-n"}) == Q
for: 1m
labels:
severity: critical
annotations:
title: ZooKeeper ensemble at minimum server membership
description: "ZooKeeper ensemble at minimum server membership.\n VALUE = {{ $value }}"
- alert: ZooKeeper ensemble broken
expr: sum(up{juju_application="zookeeper",juju_model="kafka-n"}) < Q
for: 1m
labels:
severity: critical
annotations:
title: ZooKeeper ensemble broken
description: "ZooKeeper ensemble broken.\n VALUE = {{ $value }}"
@nishant-dash - What we probably CAN do come to think of it, is something similar to some of the following:
sum(up) < zookeeper_QuorumSize
as a WARNING
- Alert that a ZooKeeper server that WAS part of the ensemble, is no longer reachable. Dead unitsum(up) <= (zookeeper_QuorumSize + 1) / 2
as a CRITICAL
- Alert that we are about to, or already have lost quorumzookeeper_QuorumSize < sum(up)
as a WARNING
- Alert that a ZooKeeper server that IS reachable with an active service, is not part of the ensemble. Not joining successfullyWe'd need to validate these, and we can discuss the alert levels and thresholds later, but would something like this address your requirements?
@marcoppenheimer IMO, sum(up) < zookeeper_QuorumSize
should be CRITICAL
. In all cluster related alerts I have seen across various other products, its always a critical alert on reduced cluster size, even if the cluster is functional.
And as such, I personally don't check WARNING
actively so having it set to WARNING
will result in me missing it from an alert stand-point.
Although, how is zookeeper_QuorumSize
set? what if a node gets removed and zookeeper_QuorumSize
is reduced?
For example, if zookeeper_QuorumSize
is 3, one unit gets removed somehow and its set to 2
, and the metric from it is absent so up
wont find it, and you get 2 < 2 which wont fire
@nishant-dash Quorum size is determined by the configuration file, by the servers configuration block. It stays the same if no unit is added or removed from the cluster. Upon my testing, absent function is not really useful given that it requires the query to result in an empty set to output 1. This is only possible if we target a specific unit/instance or lose all the units,instances. I instead, saw the following discussions https://stackoverflow.com/questions/55162188/prometheus-how-up-metrics-works https://stackoverflow.com/questions/53191746/prometheus-absent-function https://stackoverflow.com/questions/64049067/how-can-i-find-missing-metrics-in-prometheus-that-have-a-certain-label However, this problem should be triaged to the monitoring team and a common solution should be devised rather than trying to solve it for Zookeeper only. I already talked with @Abuelodelanada, but could not find time to create the issue.
Thank you @acsgn for the research and for looking into this so carefully! I'd say that the modifications proposed in the linked PR are still rather beneficial and general, although they do not fully address the issue here. So once the PR is merged I'll close this issue, as not relevant and superseded by the issue to be opened to the monitoring stack.
https://github.com/canonical/zookeeper-operator/blob/576dad9beabf44396fd0adb79a28eeb0586888ed/src/alert_rules/prometheus/zookeeper_metrics.rules#L8
only has the up clause but what if t he metric is missing entirely due to any possible problem?
Perhaps something like this might be better?
in fact, we could probably take a the up metric == 0 over some time >0m as is this alert flaps, it can get noisy (and have a separate alert rule that checks the avg over time to detect flappiness).