Icinga / icinga-powershell-framework

This PowerShell module will allow to fetch data from Windows hosts and use them for inventory and monitoring solutions. Together with the Icinga Web 2 module, a detailed overview of your Windows infrastructure will be drawn.
MIT License
75 stars 33 forks source link

show and consider only the average cpu load #218

Closed bieba closed 3 years ago

bieba commented 3 years ago

Is it possible to consider and show only the average cpu load? AFAIK the check considers always the current and the average value.

Perfdata image Our custom variables: image

We are using the performance cache.

AFAIK the old NSClient++ returned always averaged values and we would like to have this way of monitoring.

LordHepipud commented 3 years ago

Hello,

yes you can do so. You can filter for -Core '_Total'. This will only include the average for the check afterwards.

bieba commented 3 years ago

Hi,

I only want the average. How can I achive that?

[SERVER]: PS C:\Users\USERNAME\Documents> Invoke-IcingaCheckCPU -Critical 90 -Warning 80 -Core '_Total'
[OK] Check package "CPU Load"
| 'core_total_15'=18.351853%;80;90;0;100 'core_total'=16.59005%;80;90;0;100
0
LordHepipud commented 3 years ago

You are getting the total core value only right now. The core_total_15 value is a performance metric coming from the background service daemon. This is the 15m average you configured with Register-IcingaServiceCheck.

To get rid of the 15m average, you can either stop the icingapowershell service, remove the Start-IcingaServiceCheckDaemon with

Unregister-IcingaBackgroundDaemon 'Start-IcingaServiceCheckDaemon'

or unregister Invoke-IcingaCheckCPU from being executed in the background:

Show-IcingaRegisteredServiceChecks;

Unregister-IcingaServiceCheck -ServiceId '<id of the service you want to unregister>'

Afterwards restart the service

Restart-Service icingapowershell

You cannot disable this performance metric from being added automatically, as this is the whole point of the feature. The monitoring only applies to the actual value of _Total, which is in your case 16.59%. The other value is just for writing into performance metrics tools, like Grafana or InfluxDB.

bieba commented 3 years ago

Thanks for explanation, but you got me wrong. I only want the 15m average 18.35% (like the NSClient++ works), I want to get rid of the current value 16.59%. The service is flapping as the load is going up and down but the baseline is good and no hardware amendments have to be made.

LordHepipud commented 3 years ago

Now I understand your issue! Thanks for the explanation. This is currently not possible with the current design on how plugins and the background daemon work. So basically you do not want to monitor the current state of the CPU, but the 15m average.

I will evaluate on how we can implement this in a generic way for the entire solution then.

bieba commented 3 years ago

Exactly. AFAIK NSClient++ measures only the averages for cpu monitoring. I think for network i/o and cpu usage this is an important use case. Thanks in advance for having a look into it.

bieba commented 3 years ago

This is a big issue. We have several servers which flap between OK and CRITICAL soft state because of the spikes while the check is executed. Especially the cpu monitoring creates massive false positives. Framework cache is enabled, without the problem is much bigger.

image

log1-c commented 3 years ago

Just a quick thought: How about adding multiple thresholds for the different average values, like the load checks for linux systems have?

bieba commented 3 years ago

Good thought. Would also be nice, NSClient++ also supports that.

LordHepipud commented 3 years ago

Yes, this idea in general is great. I would however want to provide a generic solution for this, because in case you are having this requirement for multiple plugins, I do not want to add +n arguments.

This is also true if you add custom averages, as you could want the average for 30 minutes - which would require another argument then.

How is the general approach to this? Would it be fine to have one -Warning and one -Critical and define with a special value if you want to compare just the raw value or the averages if possible?

For example something like this:

Invoke-IcingaCheckCPU -Warning '80?a5m' -Critical '90?a10m'

Not sure about the syntax yet, but this could provide maximum flexibility. In case there are no averages available for the provided values, we could either fall back to the the raw value or - more correctly - print an UNKNOWN

bieba commented 3 years ago

The old NSClient++ uses only one warning/critical threshold for all timespans.

Linux plugins would have a separator like ",". For the intervals 1m, 5m, 15m it would be something like this:

Invoke-IcingaCheckCPU -TimeSpan '1m,5m,15m' -Warning '80,70,60' -Critical '90,80,70'

For the 1m interval the warning threshold would be 80% and the critical threshold 90%. 5m --> 70% warn /80% crit 15m --> 60% warn /70% crit

LordHepipud commented 3 years ago

I don't really want to add an array based assignment for every single service check and threshold possible. In my opinion this will ruin the entire experience because then you have to ensure the order on how you add your values is properly.

My suggestion by adding a new format to the Nagios/Icinga Plugin guidelines: Plugin Guidelines

would make much more sense and keeps the handling straight forward, instead of adding another, different method for this.

bieba commented 3 years ago

If you have another way of accomplishing the desired behaviour, it would be fine.

log1-c commented 3 years ago

I don't understand your suggestion, so atm I'm in favor of the proposal of @bieba in his previous post. This is, at least for people monitoring linux systems, a well known way to implement checks.

The -Timespan would be an optional parameter, and only if this is set you'd have to check that -Warning and -Critical have the needed number of values. Also this additional check option would only be necessary for the CPU check imo.

LordHepipud commented 3 years ago

And here is my problem, because there is also feature request #220 open, which runs into the same problem. Therefor having the -Timespan argument is not suitable and I would love to have a generic approach for this. As the Network Interface plugins provides a huge number of threshold values, this would not work.

Therefor I doubt that this is only applying to the CPU plugin but might also for all kind of different plugins. Therefor I would love to have a more generic approach for this instead of hardcoded plugin behavior.

I do agree that this is done on the Linux world and it would make an easier transition. However, we this doesn't mean to look into different solutions to improve the situation and make it more generic and accessible to every plugin someone develops. Because in the end, the less a developer has to are about the implementation, the better it is for everyone.

What about this idea:

Invoke-IcingaCheckCPU -Warning '80?5' -Critical '90?5'
LordHepipud commented 3 years ago

Another idea:

Invoke-IcingaCheckCPU -Warning '80~5' -Critical '90~5'
bieba commented 3 years ago

What does this mean? Only 5min check intervall? How can I check three intervals like 1min, 5min, 15min?

Maybe another idea: Use only one interval and the user has to create 3 checks/services...? In this case a simple interval argument in combination with the existing warning/critical would make sense.

Edit: But personally I would prefer the linux way, e.g. three intervals combined in one call.

LordHepipud commented 3 years ago

It will take the stored average value of 5 minutes and compare it to the threshold 80/90. If it exceeds, it will cause warning/critical. The combination with other default plugin threshold logic should also still apply:

Invoke-IcingaCheckCPU -Warning '80:~5' -Critical '90:~5'
Invoke-IcingaCheckCPU -Warning '70:80~5' -Critical '90:100~5'
Invoke-IcingaCheckCPU -Warning '`@0:80~5' -Critical '`@0:90~5'

The problem I currently see is that we focus way to much on the CPU plugin itself. There are other plugins that I have been asked to implement a similar feature, so the generic approach would be best while also ensuring that the entire plugin logic on how thresholds are handled can be applied for everything.

I'm not sure if adding more complexity for allowing multiple values inside the thresholds is beneficial. Just one example:

Invoke-IcingaCheckCPU -Warning '70:80~1,70:80~5,70:80~10' -Critical '90:100~1,90:100~5,90:100~10'
log1-c commented 3 years ago

Is it possible to consider and show only the average cpu load? AFAIK the check considers always the current and the average value.

Perfdata image Our custom variables: image

We are using the performance cache.

AFAIK the old NSClient++ returned always averaged values and we would like to have this way of monitoring.

One thing that just came to my mind: Wouldn't it be possible to simply only include the average values by using -Core '_total_15' ? For having multiple average values it then would be necessary to be able to use the -Core parameter multiple times or as an array

LordHepipud commented 3 years ago

In general this would be possible, but I do not see the benefit to implement this. This would require to entirely independently rework this one plugin, to resolve a wider feature request.

Allowing more cores to be included for the -Core argument would be an option, but is this actually a use case? From my experience so far, most people only care about the _Total core instead of anything else.

bieba commented 3 years ago

It will take the stored average value of 5 minutes and compare it to the threshold 80/90. If it exceeds, it will cause warning/critical. The combination with other default plugin threshold logic should also still apply:

Invoke-IcingaCheckCPU -Warning '80:~5' -Critical '90:~5'
Invoke-IcingaCheckCPU -Warning '70:80~5' -Critical '90:100~5'
Invoke-IcingaCheckCPU -Warning '`@0:80~5' -Critical '`@0:90~5'

The problem I currently see is that we focus way to much on the CPU plugin itself. There are other plugins that I have been asked to implement a similar feature, so the generic approach would be best while also ensuring that the entire plugin logic on how thresholds are handled can be applied for everything.

I'm not sure if adding more complexity for allowing multiple values inside the thresholds is beneficial. Just one example:

Invoke-IcingaCheckCPU -Warning '70:80~1,70:80~5,70:80~10' -Critical '90:100~1,90:100~5,90:100~10'

I don't like this way, as you are mixing values for two inputs (e.g. "70:80~1"). One the one hand for the warning/critical threshold ("70:80") and on the other hand for the time interval ("1").

I also don't get the point what the difference between

Invoke-IcingaCheckCPU -TimeSpan '1m,5m,15m' -Warning '80,70,60' -Critical '90,80,70'

and

Invoke-IcingaCheckCPU -Warning '70:80~1,70:80~5,70:80~10' -Critical '90:100~1,90:100~5,90:100~10'

should be? In your examples you also want to use multiple values per argument.

LordHepipud commented 3 years ago

I dont want to mix multiple values at all - it was mentioned you want it. I do not want to add multiple thresholds in one argument. The example was just to demonstrate that it looks weird.

I still think the best approach is

Invoke-IcingaCheckCPU -Warning '70:80~1' -Critical '80:90~1'

And yes, I am mixing things together, because this will give the most flexibility.

From my point of view, I do not want to ensure that if a plugin has five different warning and critical thresholds, to end up like this:

Invoke-IcingaCheckNetworkInterface -PacketReceivedDiscardedWarn '100,150,200' -PacketReceivedDiscardedCrit '200,250,300' PacketReceivedDiscardedTimeSpan '1m,5m,15m' -PacketOutboundDiscardedWarn '150,200,300' -PacketOutboundDiscardedCrit '300,400,500' -PacketOutboundDiscardedTimeSpan '1m,5m,15m' ...

Now add the other 10 attributes for Warning and Critical there. You can't tell me, this is a good solution at all.

You even opened a feature request (#220) for measuring the average network i/o.

Now do we want to modify every single plugin manually to enable this feature and add way more complexity for the development process and even the integration into the monitoring by adding plenty of new arguments?

I don't understand, why a generic approach sees so much negative feedback. An approach, which could be used by any plugin and any data point fetched, without having to touch one line of code.

Yes, the idea with the ~ might not be the best and we could maybe think of another symbol for it.

But it's not less stupid than

@30:40

or

~:30

Just my two cents to this.

LordHepipud commented 3 years ago

I had some time now to think about this request and the generic approach for other plugins and had some additional discussions internally.

The prefered solution that will be added with v1.5.0 will be this:

icinga { Invoke-IcingaCheckCPU -Warning 20 -Critical 40 -Core _Total -ThresholdInterval 15m }

[WARNING] CPU Load: [WARNING] Core Total (29,14817700%)
\_ [WARNING] Core Total: 29,14817700% is greater than threshold 20% (15m avg.)
| 'core_total_1'=31.545677%;;;0;100 'core_total_15'=29.148177%;20;40;0;100 'core_total_5'=28.827410%;;;0;100 'core_total_20'=30.032942%;;;0;100 'core_total_3'=27.731669%;;;0;100 'core_total'=33.87817%;;;0;100

We add a generic argument ThresholdInterval which is available for every single check plugin, but does not require any developer action. It will be statically added during runtime and in case set we will use the provided avg. value from the stored cache.

We will not add the possibility for multiple intervals and check thresholds, as this has several problems:

By the end of the week a new branch will appear which will inherit the entire re-work and integration of this feature for public testing.

LordHepipud commented 3 years ago

The PR has been created and is linked to this issue. Plugin configurations are already compiled and available on the plugin repository with the new feature for the ThresholdInterval. Please feel free to test it and give us feedback.

bieba commented 3 years ago

I am fine with the new way.

But as a user I have a problem with some missing feedback, e.g. If you query with a threshold which is not registered by the background service you won't receive any notification about that. I don't know whats going on, if you use a "wrong" (not registered) threshold.

I downloaded both repos.

Register-IcingaBackgroundDaemon -Command 'Start-IcingaServiceCheckDaemon';
Register-IcingaServiceCheck -CheckCommand 'Invoke-IcingaCheckCPU' -Interval 30 -TimeIndexes 1;
Restart-IcingaService 'icingapowershell';

Invoke-IcingaCheckCPU -ThresholdInterval 15m

Output

 Invoke-IcingaCheckCPU
[OK] CPU Load:
| 'core_4'=2.731483%;;;0;100 'core_total'=3.33096%;;;0;100 'core_6'=2.562467%;;;0;100 'core_5'=2.076547%;;;0;100 'core_7'=5.351226%;;;0;100 'core_0'=5.372353%;;;0;100 'core_2'=4.041354%;;;0;100 'core_1'=2.646975%;;;0;100 'core_3'=1.865278%;;;0;100

The threshold unit doesn't seem to play any role.

Maybe I made a mistake?

bieba commented 3 years ago

In the meantime I used the command from the docs and modified the registered background check to use the intervals 1, 3, 5, 15.

LordHepipud commented 3 years ago

If the threshold interval is not present, the output should look like this:

icinga { Invoke-IcingaCheckCPU -Warning 20 -Critical 40 -Core _Total -ThresholdInterval 13m }

[UNKNOWN] CPU Load: [UNKNOWN] Core Total (33,9108400%)
\_ [UNKNOWN] Core Total: The provided time interval "13m" which translates to "13m" in your "-ThresholdInterval" argument does not exist
| 'core_total_1'=31.545677%;;;0;100 'core_total_15'=29.148177%;;;0;100 'core_total_5'=28.827410%;;;0;100 'core_total_20'=30.032942%;;;0;100 'core_total_3'=27.731669%;;;0;100 'core_total'=33.91084%;20;;0;100

In your example it could happen, because there is no -Warning or -Critical assigned. The Threshold argument will simply apply to monitoring itself, but not modify the output if no thresholds are set.

Thats something we could maybe add that always the Nm value is returned instead.

bieba commented 3 years ago

I think we have another issue which came up as I was trying to test the new feature/PR. The folder "cache" didn't has appropriate permissions as the background service (for me its system) wasn't able to write to disk. So the background service wasn't able to push the background settings to disk, Link.

But another thing I am thinking of: The outputed performance data still contains the other time intervals (all I have had registered within the background check). Maybe you @LordHepipud could amend it?

PS C:\Windows\system32> Invoke-IcingaCheckCPU -Core _Total -ThresholdInterval 1m -Warning 80 -Critical 90
[OK] CPU Load:
| 'core_total_5'=20.745036%;;;0;100 'core_total_1'=26.734055%;80;90;0;100 'core_total_15'=18.941266%;;;0;100 'core_total
_3'=24.371054%;;;0;100 'core_total'=24.52437%;;;0;100
0

The check seems to recognize the correct threshold value. Thumbs up!

PS C:\Windows\system32> Invoke-IcingaCheckCPU -Core _Total -ThresholdInterval 5m -Warning 21 -Critical 90
[WARNING] CPU Load: [WARNING] Core Total (21,55183400%)
\_ [WARNING] Core Total: 21,55183400% is greater than threshold 21% (5m avg.)
| 'core_total_5'=21.551834%;21;90;0;100 'core_total_1'=13.243994%;;;0;100 'core_total_15'=18.398668%;;;0;100 'core_total
_3'=20.267918%;;;0;100 'core_total'=45.84883%;;;0;100
1
LordHepipud commented 3 years ago

We discussed this internally and we had two factions:

I honestly don't mind them being collected, because it might still be helpful for performance graphs to have the current value including the other values added, especially when you have only one check for lets say 15m. Then you have the other values as well.

I'm not sure if users would actually add the same check multiple times to get all performance metrics out.

Another idea was to add another generic argument to the check, allowing to configure it. But honestly, this will just crowd the arguments way too much.

The only reasonable solution would be to deal with this by using a Framework configuration which has then to be applied for each plugin.

I'm open to suggestions for this.

bieba commented 3 years ago

I partially agree, as there are might some use cases where it helps you but: If I want to monitor the used free space of drive C:, I don't want to know how the other drives are used. And thats the same thing for the cpu load use case.

If you want to display all metrics, you have to set up a check without a threshold intervall and collect all the desired metrics. AFAIK this should work in the way the plugin is currently designed.

Another idea was to add another generic argument to the check, allowing to configure it. But honestly, this will just crowd the arguments way too much.

I agree.

LordHepipud commented 3 years ago

But this is a different topic. If you filter only for drive C, you only get drive C. If you only filter for core _Total, you only get core _Total. It's just the threshold metrics that are added in addition for the provided filtering.

Let's not focus on CPU, but look on all other plugins: Would you expect to get all other metrics removed from ANY other plugin, once you add -ThresholdInterval 15m for any other value?

bieba commented 3 years ago

Yes, it´s a bit different but simpler to follow.

But my opinion is still the same, I only want to get the metrics I queried for. If I want to get all performance counters I have to omit the -Threshold argument.

LordHepipud commented 3 years ago

Okay, so you will actually add a check to monitor your network interface for all thresholds, and for lets say 15m. You will also check your CPU either without the threshold, in addition to the interval or will add checks for 3m and 5m as well if you want those metrics.

Lets make an example:

To get this example done, you will have to deploy 3 checks for the CPU.

While the last two options are fine for me and as I want them, I doubt you or anyone else will run a third check just to collect metrics for any plugin at all.

Lets say you collect the 20m interval as well. Without the plain configuration of the plugin, you never have this data available until you do so.

Is this really something users would actually do on their environment?

bieba commented 3 years ago

In this example, yes. For me: I would only have one check (15m) .

I am also fine with including them in every check. But if you don't want the other metrics displayed you have to set up a separate filter in grafana. This is the drawback of this solution.

What do the others think? @log1-c

bieba commented 3 years ago

Another idea was to add another generic argument to the check, allowing to configure it. But honestly, this will just crowd the arguments way too much.

You also did this for -ThresholdInterval, maybe this is the compromise?

The only reasonable solution would be to deal with this by using a Framework configuration which has then to be applied for each plugin.

You know that this would also be fine in 95% of all cases, but the one... :-)

Btw: Thanks for your good job @LordHepipud.

LordHepipud commented 3 years ago

Thank you very much. The issue with "lets configure the plugin output for specific plugins differently" is, that this "one" exception is fine. And then many more will follow and noone understands them anymore when there are 20 configuration flags inside the framework 🤣

The release is planed next week as long as there are no critical issues left submitted by the community or following internal tests.

If we don't have an agreement on it now, we could leave it as it is and tune it with the next release even more by collecting more user feedback.

log1-c commented 3 years ago

Sorry for not being that active in here. We are currently facing other, more pressing issues with the framework, as it(or better the PS checks) is generating way more CPU load than anticipated. On some VMs (4 cores) the framework checks have CPU spikes of up to a 100% and thus generate an avg CPU load of 10-30%. Iirc there already is an issue for that. I'll try updating it with our findings/analysis, once I had a good overview.

As for the CPU check, I don't mind if the check collects the "not needed" perfdata. If someone does not want them visualized they could exclude them via Grafana or Graphite, if I am not mistaken.

LordHepipud commented 3 years ago

Yes you are correct. We will move forward with this approach then. Lets see how the feedback is from a wider range.

Thanks everyone!