NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 95 forks source link

Feature Request: Add option to allow the ability to specify the interval value for 'cpu/percent' node #802

Open MrPippin66 opened 3 years ago

MrPippin66 commented 3 years ago

NCPA = 2.3.1

This may become a feature request, but for check_ncpa cpu/percent reporting, I'm seeing this is reporting the long-term CPU consumption, and not the immediate CPU consumption.

Essentially, using this option:

-M cpu/percent -q warning=90,critical=95,aggregate=avg

However, despite there not being a CPU consumption issue going on, this is alerting.

Upon looking into a specific instance, the problem is due to the reporting being based on the total system counters, and not the system as it currently is.

Is there any option to have the reporting be based on the current system consumption, and not the total metrics of the systems counters.

If not, I'd like to request a feature to report the current consumption, since even after resolving a problem, this is still alerting (for a relative period of time until the system counter averages go below the threshold)

MrPippin66 commented 3 years ago

After looking into this issue some more, It is actually giving a relative CPU consumption over a period of time.

But...it's hard coded to do this over a 0.5 second interval.

In psapi.py

cpu_percent = LazyNode('percent', method=lambda: (ps.cpu_percent(interval=0.5, percpu=True), '%'))

If I iterate this over a period of time, I can see this is varying, though in servers specific case, it's mostly showing 100%, despite other tools showing greater variance.

Example:

Checking every second with "check_ncpa", I get:

--

OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 0.00 % | 'percent'=0.00%;;; OK: Percent was 45.10 % | 'percent'=45.10%;;; OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 100.00 % | 'percent'=100.00%;;; OK: Percent was 16.70 % | 'percent'=16.70%;;;

--

But, vmstat over it's lower threshold shows a greater disparity.

(1 second interval)

--

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 7 0 1645312 280936 301824 2304828 0 0 16 168 5 1 47 5 48 0 0 4 0 1645312 295412 301824 2304876 0 0 0 44 1883 1741 96 4 0 0 0 0 0 1645312 295480 301828 2304876 0 0 0 1130 1289 2671 8 4 87 1 0 0 0 1645312 295496 301828 2304880 0 0 0 0 1117 2375 3 3 94 0 0 11 0 1645312 295512 301828 2304880 0 0 0 108 1819 1857 92 2 6 0 0 4 0 1645312 295480 301832 2304876 0 0 0 60 1846 4972 93 7 0 0 0 5 0 1645312 295412 301832 2304904 0 0 0 2 1765 6881 93 7 0 0 0 6 0 1645312 295436 301840 2304900 0 0 0 96 1842 2095 97 3 0 0 0 3 0 1645312 295360 301852 2304900 0 0 0 76 1930 1819 97 3 0 0 0 3 0 1645312 295400 301852 2304908 0 0 0 16 1849 1787 97 3 0 0 0

--

I'd like to request an option be added to the CPU monitoring (and this thus far seems specific to cpu/percent) to be able to specify the interval value.

MrPippin66 commented 1 year ago

@sawolf This has been there for almost 2 years with no evaluation. It's highly desirable to have an option to include the sample period.

We spikey loads that trigger this fairly often, and even requiring multiple checks still sees this trigger with what's really a false positive.

sawolf commented 1 year ago

This has been there for almost 2 years with no evaluation

Yeah, that's not very cool of us.

It's highly desirable to have an option to include the sample period.

Agreed that that doesn't look like it would be hard to implement. As with your more recent feature requests, I can't guarantee for you that this will get done, that's mostly up to the current NCPA developers and the development manager.

MrPippin66 commented 1 year ago

Agreed, and understandable. I'm sure there's a very large backlog behind just getting V3 out the door