NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.53k stars 445 forks source link

Service dependency does not work as Nagios Core schedules the checks in alphabetical order #780

Open gvfnix opened 4 years ago

gvfnix commented 4 years ago

I'd like Nagios Core 4.4.5 to suspend checks via NRPE when NRPE port is unreachable on a host. Also I set soft_state_dependencies=1 in nagios.cfg to use the latest check result with dependencies. To test this I've created a simple configuration in Nagios:

define hostgroup {
    hostgroup_name  nagios-host
}

define host {
    host_name   host1
    address 10.20.30.40
    hostgroups +nagios-host
    check_period 24x7
    check_interval 1
    max_check_attempts 2
    contact_groups testing
    notification_interval 180
    notification_period never
}

define service {
    service_description nrpe_agent_running
    display_name NRPE agent running
    hostgroup_name nagios-host
    check_command check_tcp!5666
    max_check_attempts 2
    check_period 24x7
}

define service {
    service_description fstab_mounted
    display_name fstab is mounted
    hostgroup_name nagios-host
    check_command nrpe!check_mount
    max_check_attempts 2
    check_period 24x7
}

define servicedependency {
    hostgroup_name nagios-host
    service_description nrpe_agent_running
    dependent_service_description fstab_mounted
    inherits_parent 1
    execution_failure_criteria w,u,c,p
    notification_failure_criteria w,u,c,p
}

To apply this configuration I stop Nagios, wipe out retention.dat and objects.cache files in /usr/local/nagios/var directory and then start Nagios again.

After Nagios starts, it has both services pending, but fstab_mounted service check is scheduled befor nrpe_agent_running: Selection_016

When Nagios executes the check for fstab_mounted service it puts the service in CRITITAL state: Selection_017

Then the check for nrpe_agent_running gets executed and both services turn red: Selection_018

I noticed, that if I just rename fstab_mounted to z_fstab_mounted so as this service description came after nrpe_agent_running regarding alphabetical order, then everything works fine. Both grey: image

Then nrpe_agent_running turn red: image

And then Nagios keeps rescheduling checks for z_fstab_mounted without executing them: image

I believe that service description should not influence the dependency feature in that manner.

dbray925 commented 6 months ago

Wow, 4 years later and just stumbled onto this same issue. Any updates?

ericloyd commented 6 months ago

Just make a check that does check_nrpe with no arguments. If it comes back successfully, then NRPE is working on that host. Then make your NRPE-based checks dependent on that check not being in a CRITICAL state, and you've just solved the problem.

ericloyd commented 6 months ago

Whoops. Wrong editor. Hit "Comment" accidentally. To continue...

Since Nagios doesn't know what technology you're using to check something, it can't just suspend checks when NRPE is not working. You have to do that by hand. You know. With dependencies. :-)

dbray925 commented 6 months ago

But what if the service_description is "check_nrpe" and is a dependent on dependent_service_description "aaa_check". We'll always get alerted on "aaa_check" first and then an additional alert on "check_nrpe" because of the alphabetical issue @gvfnix discovered.

I just checked this on version 4.4.10, and can duplicate the issue. Meaning, this causes one alert, just the check_nrpe:

define servicedependency{
        host_name                       uniquehostnamehere
        service_description             check_nrpe
        dependent_service_description   aaa_check
        execution_failure_criteria      w,u,c,p
        notification_failure_criteria   w,u,c,p
        }

This causes two alerts, both aaa_check and check_nrpe:

define servicedependency{
        host_name                       uniquehostnamehere
        service_description             aaa_check
        dependent_service_description   check_nrpe
        execution_failure_criteria      w,u,c,p
        notification_failure_criteria   w,u,c,p
        }
ericloyd commented 6 months ago

So what solution would you like to see implemented?

dbray925 commented 6 months ago

Pretty much like @gvfnix said, "service description should not influence the dependency feature in that manner". How to go about that, I'm not sure. For now, if we wanted to use the dependency feature properly, we'll have to rename our service_description tags with "0001, 0002, etc" in front of them and go from there. Would be great if display_name was working with the CGI, then we wouldn't have the cosmetic issue this renaming will cause.

ericloyd commented 6 months ago

So you want Nagios to process the service dependencies in the order in which they are listed in the dependency? What if you have multiple, separate dependencies; how would it know what to do, then? And this might be just for Nagios Core but Nagios XI needs to know how to create the Core config files to match the order you want when the time comes for that, so it's a sticky wicket.

dbray925 commented 6 months ago

Agreed, gets a little tricky. The display_name fix would help out with the overall issue in that case. Wouldn't matter what the service_description is at the point, the end user would still see the proper (descriptive) display_name in the web UI.