Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

[dev.icinga.com #11900] Question: Possibility to add a calculated metric to perfdata of an existing check? #4260

Closed icinga-migration closed 8 years ago

icinga-migration commented 8 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11900

Created by wwagner on 2016-06-06 08:41:54 +00:00

Assignee: (none) Status: Rejected (closed on 2016-06-13 07:03:21 +00:00) Target Version: (none) Last Update: 2016-06-13 07:03:21 +00:00 (in Redmine)

Backport?: Not yet backported
Include in Changelog: 1

Hi, I am looking for a way to add a calculated metric to the existing checks.

Background: In our environment, we have very different server setups and WARN/CRIT constraints. We think about using normalized time series to be able to compare and aggregate values of different setups to each other. For instance a OK range 0..100% between 0 and WARN, representing the projected and normal range, additional 100..150 as warn range representing the range between WARN and CRIT. This would allow us to interpret values without deeper background knowledge, would allow us to display summarized load levels etc. We would generate an additional perfdata value and send it to the TSDB...

Is it possible to intercept and post process the perf_data before the data is sent to the TSDB?

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-06-07 20:06:50 +00:00

No, Icinga 2 does not allow you to manipulate the performance data value sent by the plugin. Isn't that more reasonable to add such metric calculation directly onto the plugin where the values are present?

icinga-migration commented 8 years ago

Updated by wwagner on 2016-06-08 07:04:05 +00:00

Yes indeed. The issue I have with changing the plugins, is that I will have to change all used plugins, to produce the normalized result. I get x copies of the same code, probably in different languages and the code of someone else. This is probably too much effort and not feasible in our environment.

It would be so "nice" to be able to do this post-processing in icinga2, right before it is stored in the TSDB. One spot, one code in one language, independent from other plugins (nagios checks)...

Having additional normalized metrics that represent the projected load (0..WARN) between 0% and 100% and an acceptable overload (WARN..CRIT) for instance between 100% and 150% is quite interesting if you want to summarize a load level for different monitoring purposes. If you starting thinking about the new possibilities, one can find many attractive usages, only by adding a small change at the right spot.

icinga-migration commented 8 years ago

Updated by wwagner on 2016-06-08 11:29:40 +00:00

I wonder, if this feature would be placed best in the PerformanceDataWriter(s), controlled by two parameters that control how many % are values at warn and crit threshold. The disadvantage would be, that icinga2 and its plugins could not work with the additional perfdata value. But it would be in the TSDB and could be displayed in embedded graphs and used in dashboards.

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-06-09 10:52:47 +00:00

How about something like this:

globals.postprocess_perfdata_http = function(checkable, perfdata) {
  for (pd in perfdata) {
    if (pd.label == "time") {
      pd.value /= 7
    }
  }
}

object CheckCommand "http" {
  perfdata_handler = postprocess_perfdata_http
}

The perfdata_handler function is then called whenever Icinga needs to do something with the performance data values, e.g. for Graphite/OpenTSDB. The only exception would be the PerfdataWriter module - because external addons (in particular PNP4Nagios) expect perfdata values to be unmodified - for example PNP4Nagios chokes on perfdata where keys have been re-ordered.

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-06-09 11:16:00 +00:00

Minor problem: Setting the perfdata_handler attribute for ITL commands is somewhat awkward.

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-06-09 11:46:52 +00:00

feature/perfdata-handler-11900 has a preliminary patch for this.

icinga-migration commented 8 years ago

Updated by wwagner on 2016-06-10 13:05:58 +00:00

Mh. This code example will work in the future, after a new release right? I could not implement it for some reason.

We are not sure yet, about the final TSDB, but we think it will be either openTSDB or influxDB (after InfluxDBWriter is released). As much as I know, about this setup, there are separate metrics for the metric value, for the settings of the WARN and CRIT thresholds and even MIN and MAX, if it is delivered in the check results. (I did a quick look into opentsdbwriter.cpp) The code example above looks to me, like you modify the existing value by a calculated value. This is not what I think of. I imagine an additional metric identified for instance as ".norm".

I see a lot of useful purposes. We have many servers, most of them with many CPUs and very different features. With this value, we get a "OK_range", the range of normal operation. But we also get a "load" value, 10% is at the lower end, 90% at the upper end of normal operations... Independently of which metric we are looking at.

In Grafana we can summarize the time series of the load of all processors in one machine, over the processors of all machines in a load balanced application setup, etc... We can simply calculate the max CPU usage... We can even summarize different metrics... And we have trends... In our test setup we have integrated Grafana graphs into icingaweb2. One major flaw with this is the inability to control the warn and crit thresholds of generic dashboards from icingaweb2 (because of the missing feature in grafana). A norm value would help there too, because warn is for instance always at 100% and CRIT at 150%... In the moment you see the history of the value, but to understand it, you have to read the values for CRIT and WARN.

To complete the picture: I plan to include many application checks and application internal time series data into our monitoring...

Sorry for all the texting, but I want you to understand me right.

Thank you! Wolfram

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-06-10 13:13:00 +00:00

I do understand the requirement, but adding additional metrics to existing check plugin results should be discussed further with more opinions and a wider audience. It also requires a clear cut design (ability to add that, configuration, possibilities for calculation and such).

One of the things which bugs me - in terms the performance data is not working, or somehow malformed, where would you start debugging? The plugin script provides the default metrics and obviously you need to dig deeper into some configuration settings which manipulate the performance metrics in addition to that.

Afterall I still see the plugin being responsible for providing its metrics, and we shouldn't fiddle with that before carefully evaluating the pros and cons of such an addition. Right now it seems overly complex in the current design.

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-06-13 07:03:21 +00:00

Very well then, closed.