Closed box293 closed 1 month ago
Take a look at http://man7.org/linux/man-pages/man3/getloadavg.3.html and http://man7.org/linux/man-pages/man1/uptime.1.html. The check_load plugin uses getloadavg and thus uses the same numbers: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_load.c#L329
1.0 represents 100% of 1 CPU.
So .30 represents 30% of 1 CPU.
Using the -r option, you specify load in terms of 1 CPU, even on a multi-CPU system.
So with the following: command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20
You'd get warnings if you have a 1 CPU system with 15% load at 1 minute, 10% load at 5 minutes, and 5% load at 15 minutes. If you ran "uptime" at the same time Nagios did a check, it would say .15,.10,.05.
If you had a 2 CPU system with 15% load on each CPU, you'd have a total load of 30% CPU usage, which in a tool like 'uptime' would show up as .3.
I'm looking at a 8 CPU system right now and 'uptime' says "load average: 1.38, 1.67, 1.83".
That means 1 minute ago 138% of 800% CPU was being used (or 1.38 of 8 when using decimal notation which is what uptime and getloadavg and nagios's check_load plugin use).
If you specified the following: command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0
Your system would (hopefully) never warn, because you'd be se saying you want warnings at 1500% usage of 1 CPU at 1 minute, 1000% usage at 5 minutes, and 500% usage at 15 minutes.
In terms of 'uptime', it would warn when you saw somethign like this: load average: 15.0, 10.0, 5.0. If you run 'uptime' or 'top' on one of your systems now, you probably won't see that. Or if you do... you need to upgrade your system because that is a really really high load.
That said... the numbers in the sample "-r -w .15,.10,.05 -c .30,.25,.20" might be too low.
If I look at that 8 core system again: "load average: 1.38, 1.67, 1.83" is based on 8 cores. If we divide by 8... we'd get "0.17,.21,.23".
It is a somewhat busy system but there is still tonnes of CPU power left, so I don't know. I'm not a full-time sysadmin; I'm a devops/jack-of-all-trades.
My original commit was made when I was testing my Nagios config and I realized that the sample "command[check_load]=@pluginsdir@/check_load -w 15,10,5 -c 30,25,20" was not doing what it looked like it was doing.
Depending on your version of 'top', you can press 1 on your keyboard to get an overview of all your CPUs at the same time. Use 'd' to change the delay to something approaching real time and you can get a good sense of where 'uptime'/'getloadavg' are getting their aggregate scores.
That is all excellent information, it makes sense.
What is required is some better documentation on thresholds with the different plugins, including this information greatly helps. This is something I plan on publishing in the Nagios Support Knowledgebase in the near future.
As for the example thresholds, I'll leave that up to the devs to decide if we should leave them as they are.
Sounds good to me. I was just looking to see if John or I added any extra documentation at the time, but it doesn't look like it.
Cheers for planning on publishing details about the thresholds!
Here is the KB article I've I created on this topic, it links back to here.
I agree with box293, they are too low. The old ones:
check_load -w 15,10,5 -c 30,25,20
were too high!
I think good values could be these:
check_load -r -w .8,.6,.5 -c .9,.7,.6
It's just a sample-config file, but having saner sample values does sound reasonable. Send a pull request?
In the
nrpe/sample-config/nrpe.cfg.in
file there is a check_load command:Perhaps the thresholds are a little low. How about:
@minusdavid Do you have any comment, these were set as part of https://github.com/NagiosEnterprises/nrpe/commit/2c935fb2b22dde4014cbdfbb4e0ac367f71831c4