ClusterLabs / ha_cluster_exporter

Prometheus exporter for Pacemaker based Linux HA clusters
Apache License 2.0
79 stars 35 forks source link

Exporter output when pacemaker is down #177

Closed oxedions closed 3 years ago

oxedions commented 3 years ago

Hi !

First, thanks a lot for this HA exporter. We are now using it as a base exporter for our HA. 😊

I had a question regarding the fact that you start the exporter with dependency to pacemaker in your service file.

https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/ha_cluster_exporter.service

This brings 2 questions on my side :

  1. What happen when pacemaker is down ? What is the output then of the exporter ? I am asking this because we are using this exporter to grab metrics, but we also wish to fire alerts when something is failing / down.

  2. Is this dependency to pacemaker needed in the service file ? For example, I may like to start the exporter before starting the HA cluster. By default on my cluster, HA cluster is not enabled (so not starting at boot) since I prefer to have it down after a crash instead of an automatic restart without investigations. This dependency is preventing me to do that.

With my best regards

Ox

MalloZup commented 3 years ago

hi @oxedions nice question and valid remarks :+1:

1) by design the ha_cluster_exporter if pacemaker or any collector is down, it will not gather and expose the metrics.

This is https://github.com/ClusterLabs/ha_cluster_exporter/blob/8b8f9ce126d013aca5e8ea1d4568b3db64d69db8/ha_cluster_exporter.go#L122

we raise/error only if 0 collector are registered. (@stefanotorresi maybe we could think to don't raise an error if we have 0 collectors and just do nothing)

2) regarding point 2 so basically if you have other components before the pacemaker collector is running, you should be able to do it.

Also in case if node 1 for some reason, right now the 2nd exporter can catch this

Let me know if it helps.

Dario

diegoakechi commented 3 years ago

@oxedions The exporter itself will report a metric exposing with the collector could collect the data or not: https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/doc/metrics.md#ha_cluster_scrape_success. Also, if Prometheus cannot scrap some target, like the exporter is not reachable, it will also report a metric called UP (see: https://prometheus.io/docs/concepts/jobs_instances/). On both cases, you can create alerts based on these metrics.

stefanotorresi commented 3 years ago

(@stefanotorresi maybe we could think to don't raise an error if we have 0 collectors and just do nothing)

This is how it was before, and we deemed it was very unpredictable as a behaviour. If there are no services to introspect, then the exporter will fail outright so that the up metric reports the failure; there is no point in having the exporter still running without exporting anything.

oxedions commented 3 years ago

Dear @MalloZup , Dear @diegoakechi , Dear @MalloZup ,

So I can monitor ha_cluster_scrape_success, this is the value I was looking for.

By default, I start the ha_exporter at boot, and not corrosync/pacemaker, which is why I forked your service file. I am doing that because the exporter, in our configuration, is expected to run anytime, even if nothing is exported (having nothing exported, i.e. ha_cluster_scrape_success = 0 but ha_cluster = up is also an interesting value for us: means HA cluster is down for a reason, but exporter is still alive, so no need to worry about the exporter, only check HA).

Many thanks for these answers 😊 And many thanks for the exporter and the dashboard.

btravouillon commented 3 years ago

Well, in fact we do start the ha_exporter in a clone resource now.

stefanotorresi commented 3 years ago

I guess I can close this then. :v:

oxedions commented 3 years ago

Yes, thanks a lot ! 😊