NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 95 forks source link

CPU load average over time #180

Open kevinjm opened 9 years ago

kevinjm commented 9 years ago

Hi Everyone,

This is more of a feature request than an issue. Would it be possible to average the CPU usage over time similar to what NSClient++ can do?

For example, with NSClient++ we can set the following for a 10 minute average and only alert if the average is over 80 or 90%

CPULOAD!-l 10,80,90

Could this functionality be added to NCPA?

Thank you!

-Kevin

darthVikes commented 9 years ago

I would like this as well.

jomann09 commented 8 years ago

Adding this to 2.1.0 milestone since it would be a nice feature for finding out CPU usage on systems ... would have to implement is nicely though ~ not sure how yet.

PhilThurston commented 4 years ago

In it's current state the cpu api endpoint for NCPA is problematic. It can give you the current state of the CPU or even the average cpu utilization in that moment but does nothing to notify you of usage over time where the NRPE load average gave a more meaningful metric since you could (to an extent) see the CPU over a length of time not just in that moment. Spikes of CPU usually aren't cause for concern but prolonged CPU bottle-necking is and NCPA in its current state doesn't seem to have the ability to track that need properly.

We would argue that this feature should be fast-tracked as it is probably one of the biggest roadblocks of full ncpa adoption.

ericloyd commented 4 years ago

Isn't this easily solvable by using retry intervals or notification delays? The quantization of measurements has nothing to do with Nagios, per se. You could argue that the same problem exists with disk usage or user count or any other metric that returns discrete values.

PhilThurston commented 4 years ago

Great point, but not in all use cases. Memory and Disk for example both are resources that generally have consistent usage. Once you write the file it is taking space. Once you are using the memory it doesn't usually fluctuate too much. CPU though very frequently by nature and depending on the program will fluctuate at any given moment.

The interval checks that NCPA will run just gets the CPU for that point in time. So if you have it check 5 times in 10 minutes yet there happened to be a spike of 80% CPU at each of those check points it would report and give the impression that your CPU is CONSTANTLY at 80% usage when quite possibly it could only be spiking and its actually only used 20% in that time period. An average for CPU over time gives a more accurate look at the actual usage of the CPU.

You could technically do checks every second for this and achieve that "average" at that point though you're being highly inefficient with your resources. Using vastly more on both the monitoring and host servers than if you were to do say a simple single NRPE call.

ericloyd commented 4 years ago

No matter how fast you measure, you are still subject to measurment quantization. So tell Nagios not to notify unless three checks in a row, separated by 5 minutes each, are at 80%.

Otherwise, you need to artificially gather measurements and create your own average but you're still subject to the same quant problem that plagues all discrete measurements of analog things. This is where Nyquist frequencies come into play in audio, for instance.

petrolej commented 4 years ago

Hi, I agree with @PhilThurston. This is not just problem of CPU but as mentioned here all discrete performance counters. I also used to use NSClient++ where you could get average for a period of time. The note of @ericloyd is also correct but I think taking measurements every second is enough for general purposes but every three minutes (for example) in my opinion is not enough.

When using NCPA, I created a simple neverending loop in a powershell script on each Windows host which every second takes a value of all performance counters I need to make average of (CPU, disk IOPS, disk % idle time, network packets/sec...). Then I have other scripts in the NCPA plugin folder and every script reads an average from saved values for the last 3 minutes (default Centreon period). So my nagios checks actually does not use the NCPA builtin CPU check, but the custom plugin and calling that script.

BorisTrnka commented 1 year ago

hi, what about: https://psutil.readthedocs.io/en/latest/index.html?#psutil.getloadavg ?