Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Spurious problem notifications after dependency recovery #8324

Open efuss opened 3 years ago

efuss commented 3 years ago

I had a hard time finding out while, sometimes, I would receive problem notifications which should have been supressed by dependencies. False notifications are more than a nuisance, they may distract you from the root cause of a problem.

After some days of investigation, I found four areas of possible problems:

  1. While there is an implicit dependency of a service on its host, a recovering host typically starts responding to ICMP echo requests (aka pings) much earlier than services running on that host can be expected to be operational: On a server, pings are answered as soon as the kernel is running while most services will only run when multi-user mode has been reached. Similar behaviour may be observed on large printers, switches or access points. Even the ping service itself may take longer to recover than the host status because it imposes more stringent restrictions on packet loss of rtt. During the recovery period, services may be drive to HARD state, causing notifications to be sent.

  2. Service Problem Notifications that have been supressed while the host was down (or other dependencies are violated) may later be fired by a timer routine in case the service is still non-OK, dependencies are met and the service is considered not likely to be checked soon. I can't find a rationale for the method used to check the latter condition in Checkable::IsLikelyToBeCheckedSoon(), which reads

    auto threshold (GetCheckInterval() - 10);
    
    if (threshold > 60) {
        threshold = 60;
    } else if (threshold < 0) {
        threshold = 0;
    }
    
    return GetNextCheck() <= Utility::GetTime() + threshold;

    If I read this correctly, then, no matter what check_interval I configure, there is, directy after a dependency has recovered, a ten-second interval between the last (unsuccessful, dependency NG) check and the next one that could be successful, in which, if the timer happens to fire in, a Service Problem Notification will be sent just to be contradicted some seconds later. Maybe I fail to understand the logic, but it appears to me that simply

    return GetNextCheck() <= Utility::GetTime() + 60;

    would do better.

  3. Icinga has no notion of the interval in which passive check results are expected to drop in. With freshness checks, check_interval will typically be configured much larger than the expected interval. Additionally, as noted, e.g. in #5022, enable_passive_checks defaults to true, so worse, Icinga has no notion of whether passive checks are used at all for a given service. This means, for example, that in the situation mentioned above, IsLikelyToBeCheckedSoon may think it will take up to two minutes until the next check while in fact it will take less than ten seconds.

  4. While a dependency can stop active checks being scheduled, it doesn't prevent passive results being accepted. External daemons delivering passive check results may have no notion of a host being down or a dependency being violated. This may inadvertantly drive services into a HARD state.

I didn't touch the deprecated livestatus module to add the new attributes. I did try to add them to both icingadb and db_ido and update the MySQL/PostgreSQL schemata, but didn't add a schema update SQL file.

Al2Klimov commented 3 years ago

Note: as the author wrote he's already addressed the issues by himself.