NagiosEnterprises / nrpe

NRPE Agent
GNU General Public License v2.0
263 stars 134 forks source link

nrpe.cfg.in thresholds for check_load seem too small #171

Closed box293 closed 1 month ago

box293 commented 7 years ago

In the nrpe/sample-config/nrpe.cfg.in file there is a check_load command:

command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

Perhaps the thresholds are a little low. How about:

command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

@minusdavid Do you have any comment, these were set as part of https://github.com/NagiosEnterprises/nrpe/commit/2c935fb2b22dde4014cbdfbb4e0ac367f71831c4

minusdavid commented 7 years ago

Take a look at http://man7.org/linux/man-pages/man3/getloadavg.3.html and http://man7.org/linux/man-pages/man1/uptime.1.html. The check_load plugin uses getloadavg and thus uses the same numbers: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_load.c#L329

1.0 represents 100% of 1 CPU.

So .30 represents 30% of 1 CPU.

Using the -r option, you specify load in terms of 1 CPU, even on a multi-CPU system.

So with the following: command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

You'd get warnings if you have a 1 CPU system with 15% load at 1 minute, 10% load at 5 minutes, and 5% load at 15 minutes. If you ran "uptime" at the same time Nagios did a check, it would say .15,.10,.05.

If you had a 2 CPU system with 15% load on each CPU, you'd have a total load of 30% CPU usage, which in a tool like 'uptime' would show up as .3.

I'm looking at a 8 CPU system right now and 'uptime' says "load average: 1.38, 1.67, 1.83".

That means 1 minute ago 138% of 800% CPU was being used (or 1.38 of 8 when using decimal notation which is what uptime and getloadavg and nagios's check_load plugin use).

If you specified the following: command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

Your system would (hopefully) never warn, because you'd be se saying you want warnings at 1500% usage of 1 CPU at 1 minute, 1000% usage at 5 minutes, and 500% usage at 15 minutes.

In terms of 'uptime', it would warn when you saw somethign like this: load average: 15.0, 10.0, 5.0. If you run 'uptime' or 'top' on one of your systems now, you probably won't see that. Or if you do... you need to upgrade your system because that is a really really high load.

minusdavid commented 7 years ago

That said... the numbers in the sample "-r -w .15,.10,.05 -c .30,.25,.20" might be too low.

If I look at that 8 core system again: "load average: 1.38, 1.67, 1.83" is based on 8 cores. If we divide by 8... we'd get "0.17,.21,.23".

It is a somewhat busy system but there is still tonnes of CPU power left, so I don't know. I'm not a full-time sysadmin; I'm a devops/jack-of-all-trades.

My original commit was made when I was testing my Nagios config and I realized that the sample "command[check_load]=@pluginsdir@/check_load -w 15,10,5 -c 30,25,20" was not doing what it looked like it was doing.

minusdavid commented 7 years ago

Depending on your version of 'top', you can press 1 on your keyboard to get an overview of all your CPUs at the same time. Use 'd' to change the delay to something approaching real time and you can get a good sense of where 'uptime'/'getloadavg' are getting their aggregate scores.

box293 commented 7 years ago

That is all excellent information, it makes sense.

What is required is some better documentation on thresholds with the different plugins, including this information greatly helps. This is something I plan on publishing in the Nagios Support Knowledgebase in the near future.

As for the example thresholds, I'll leave that up to the devs to decide if we should leave them as they are.

minusdavid commented 7 years ago

Sounds good to me. I was just looking to see if John or I added any extra documentation at the time, but it doesn't look like it.

Cheers for planning on publishing details about the thresholds!

box293 commented 6 years ago

Here is the KB article I've I created on this topic, it links back to here.

https://support.nagios.com/kb/article.php?id=771

frayber commented 5 years ago

I agree with box293, they are too low. The old ones:

check_load -w 15,10,5 -c 30,25,20

were too high!

I think good values could be these:

check_load -r -w .8,.6,.5 -c .9,.7,.6

minusdavid commented 5 years ago

It's just a sample-config file, but having saner sample values does sound reasonable. Send a pull request?