Linuxfabrik / monitoring-plugins

200+ check plugins for Icinga and other Nagios-compatible monitoring applications. Each plugin is a standalone command line tool (written in Python) that provides a specific type of check.
https://linuxfabrik.ch
The Unlicense
207 stars 48 forks source link

uptime: Use the plugin to warn about recent reboots #722

Closed edpstiffel closed 3 months ago

edpstiffel commented 8 months ago

Describe the solution you'd like

Hi there, first things first: the Linuxfabrik-Monitoring-Plugins are great, thanks for your effort, we are using them very much.

One enhancement idea: Sometimes you want to know when a machine has been rebooted (mostly if it is unexpected), so you could maybe use the uptime plugin, but with a different scope: you want to know whether the uptime is less than 5 minutes for example to indicate that the machine was rebooted. You could extend the current uptime plugin with another parameter which allows to indicate whether the alarm is for going over the limits or going under the limits. But it would also an easy task to create a fork from the uptime plugin and modify the compare operator. What do you think?

Patric

Additional context

No response

markuslf commented 8 months ago

Sounds good, we will enhance the plugin by such capability.

markuslf commented 8 months ago

This could be achieved either by using --warning=-5m (warn when less than 5 minutes have elapsed) and --warning=5M (warn when more than 5 months have elapsed), or by implementing Nagios ranges for --warning and --critical. Let's see.

edpstiffel commented 8 months ago

That sounds like a good solution. But I think I don't understand your comment: is the use of negative numbers with the mentioned meaning already possible or is that something that you're going to implement?

markuslf commented 8 months ago

This needs to be implemented, for now I am just not sure how (and was more thinking out loud, sorry for the confusion).

xeiss commented 8 months ago

Just a suggestion from a productive point of view, it would be nice when MIN Uptime and MAX Uptime are possible in parallel. One Example: A server should be a uptime from more then 10min, but less then 365days. So you can get "unwanted server reboots" and "server needs a proactive restart (kernel updates) at least once a year"

So when you mean Nagios ranges, it should be for example "--warning=10m:365d" or only --warning=10m: for < 10m ... I took that from Nagios Plugins - Development Guidelines

Also I would set --critical default to 0 = deactivated. Less or to much uptime isn't really a critical thing in a normal use case. But may be configured by someone.

markuslf commented 3 months ago

I implemented Nagios ranges to get the desired behavior. --warning=10m:1Y indeed means "warn only if uptime is not between 10 minutes and 365 days". Don't forget to update the Linuxfabrik Python libraries as well.

BTW, very high uptimes indicate missing updates and reboots (if kernel live patching is not in place, or on a Windows server), so this is really a security issue - and that's why we warn about it by default. Alternatively, define a higher critical threshold.