canonical / zookeeper-operator

Source for Zookeeper VM Charm
Apache License 2.0
3 stars 7 forks source link

`zookeeper missing` alert rule doesnt handle absence of metric #164

Open nishant-dash opened 2 weeks ago

nishant-dash commented 2 weeks ago

https://github.com/canonical/zookeeper-operator/blob/576dad9beabf44396fd0adb79a28eeb0586888ed/src/alert_rules/prometheus/zookeeper_metrics.rules#L8

only has the up clause but what if t he metric is missing entirely due to any possible problem?

Perhaps something like this might be better?

alert: ZooKeeper Missing
    expr: up{juju_charm!=".*"} == 0 or (absent_over_time(up{juju_charm!=".*"}[15m]))
    for: 0m

in fact, we could probably take a the up metric == 0 over some time >0m as is this alert flaps, it can get noisy (and have a separate alert rule that checks the avg over time to detect flappiness).

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5626.

This message was autogenerated

nishant-dash commented 2 weeks ago

Actually, I've been going over the absent function from Prometheus, and it will only fire on a vector size 0 essentially. so if you do something very generic, it wont help. You need to target specific units with the absent metric

for example: Say you have 3 zookeepers and 1 went down for a few mins at 1pm.

absent(up{juju_application="zookeeper",juju_model="kafka-n")

wont fire because at 1pm, 2/3 units were up so the vector size return by the up clause inside absent is of size 2 (instead of 3) and wont fire since its not size 0.

However, this will fire at 1 pm since you go from having a vector size 1 to vector size 0.

absent(up{juju_application="zookeeper",juju_model="kafka-n",juju_unit="zookeeper/0"})

Above, by vector I mean the list of metrics.

marcoppenheimer commented 1 week ago

So I'll replay what I understand, and hopefully you can let me know what is being requested here so we can get it resolved.

I believe you are communicating several separate issues:

  1. You are finding that you are not getting alerts when you believe a ZooKeeper unit to be down
  2. You are finding that the up alert is flappy and firing too often
  3. There is no metric for alerting when the ZooKeeper ensemble (aka all clustered Units in the Application) is close to losing it's quorum

I believe that this Issue pertains to point 1. and point 2., but your second comment above pertains to point 3.. Is this an accurate summary?

For point 1: Prometheus automatically adds the up metric when scraping from each target, so if there is no up metric at all, then afaik you don't have Prometheus, so I'm not sure absent helps here. I'm not a Prom master so I'm happy to be corrected, but I would have expected up{} == 0 to be fairly standard. Could you detail out what behavior you would expect to see that isn't part of the current implementation?

For point 2: upstream ZK updated the docs to set the period to 1m, so we can definitely do that - have opened a quick PR which will be used for the other parts of this Issue also.

For point 3: As you mentioned in the above comment, it's tricky for us to implement blanket, as ZooKeeper can (in theory) have an arbitrary number of server members within the ensemble, so to have an appropriate alert, we would need to be aware of the total expected quorum members 'ahead' of time, in order to alert on Q < (N+1)/2 active members. If you have a fixed number N of servers deployed, may I suggest implementing your own custom alerts with your desired behavior specific to your deployments? For example, something similar to one of these, where Q is your minimum quorum size:

- alert: ZooKeeper ensemble at minimum server membership
  expr: sum(up{juju_application="zookeeper",juju_model="kafka-n"}) == Q
  for: 1m
  labels:
    severity: critical
  annotations:
    title: ZooKeeper ensemble at minimum server membership
    description: "ZooKeeper ensemble at minimum server membership.\n VALUE = {{ $value }}"
- alert: ZooKeeper ensemble broken
  expr: sum(up{juju_application="zookeeper",juju_model="kafka-n"}) < Q
  for: 1m
  labels:
    severity: critical
  annotations:
    title: ZooKeeper ensemble broken
    description: "ZooKeeper ensemble broken.\n VALUE = {{ $value }}"
marcoppenheimer commented 1 week ago

@nishant-dash - What we probably CAN do come to think of it, is something similar to some of the following:

We'd need to validate these, and we can discuss the alert levels and thresholds later, but would something like this address your requirements?

nishant-dash commented 1 week ago

@marcoppenheimer IMO, sum(up) < zookeeper_QuorumSize should be CRITICAL. In all cluster related alerts I have seen across various other products, its always a critical alert on reduced cluster size, even if the cluster is functional. And as such, I personally don't check WARNING actively so having it set to WARNING will result in me missing it from an alert stand-point.

nishant-dash commented 1 week ago

Although, how is zookeeper_QuorumSize set? what if a node gets removed and zookeeper_QuorumSize is reduced? For example, if zookeeper_QuorumSize is 3, one unit gets removed somehow and its set to 2, and the metric from it is absent so up wont find it, and you get 2 < 2 which wont fire

acsgn commented 2 days ago

@nishant-dash Quorum size is determined by the configuration file, by the servers configuration block. It stays the same if no unit is added or removed from the cluster. Upon my testing, absent function is not really useful given that it requires the query to result in an empty set to output 1. This is only possible if we target a specific unit/instance or lose all the units,instances. I instead, saw the following discussions https://stackoverflow.com/questions/55162188/prometheus-how-up-metrics-works https://stackoverflow.com/questions/53191746/prometheus-absent-function https://stackoverflow.com/questions/64049067/how-can-i-find-missing-metrics-in-prometheus-that-have-a-certain-label However, this problem should be triaged to the monitoring team and a common solution should be devised rather than trying to solve it for Zookeeper only. I already talked with @Abuelodelanada, but could not find time to create the issue.

deusebio commented 2 days ago

Thank you @acsgn for the research and for looking into this so carefully! I'd say that the modifications proposed in the linked PR are still rather beneficial and general, although they do not fully address the issue here. So once the PR is merged I'll close this issue, as not relevant and superseded by the issue to be opened to the monitoring stack.