Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Child host becomes unreachable in redundant group when parent is unreachable #10014

Open jjuanino opened 8 months ago

jjuanino commented 8 months ago

Describe the bug

Dear community, unless I am misunderstanding something, redundant groups does not work as expected. Consider the following setup:

   poc_grand_parent 
             |
             |
             | (classic dependency)
             |
             | 
       poc_parent_0                                     poc_parent_1
        \                                                /
          \        (redundant dependency)              /   
            \                                        / 
              \                                    /
                 \                               /
                    \                          /
                        \                 /
                             poc_child

The issue is as follows: when poc_grand_parent becomes down, poc_parent_0 becomes unreachable (as usual), but poc_child also, which is unexpected.

To Reproduce

Consider the following setup:

object Host "poc_grand_parent" { check_command = "dummy"; vars.dummy_state = 2; }
object Host "poc_parent_0" { check_command = "dummy"; vars.dummy_state = 0; }
object Host "poc_parent_1" { check_command = "dummy"; vars.dummy_state = 0; }
object Host "poc_child" { check_command = "dummy"; vars.dummy_state = 0;}

object Dependency "parent_0_to_grandparent" {
    child_host_name = "poc_parent_0"
    parent_host_name = "poc_grand_parent"
}

for (i in range(2)) {
    object Dependency "dep-" + i use (i) {
        child_host_name = "poc_child"
        parent_host_name = "poc_parent_" + i
        redundancy_group = "broken_red_deps"
    }
}

Expected behavior

The expected behavior is that poc_child host remains reachable despite of the state of poc_grand_parent.

Screenshots

image

Your Environment

Include as many relevant details about the environment you experienced the problem in

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later https://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

System information: Platform: Red Hat Enterprise Linux Platform version: 8.9 (Ootpa) Kernel: Linux Kernel version: 4.18.0-513.11.1.el8_9.x86_64 Architecture: x86_64

Build information: Compiler: GNU 8.5.0 Build host: ol8-template.localdomain OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021

Application information:

General paths: Config directory: /usr/local/icinga2/etc/icinga2 Data directory: /usr/local/icinga2/var/lib/icinga2 Log directory: /usr/local/icinga2/var/log/icinga2 Cache directory: /usr/local/icinga2/var/cache/icinga2 Spool directory: /usr/local/icinga2/var/spool/icinga2 Run directory: /usr/local/icinga2/var/run/icinga2

Old paths (deprecated): Installation root: /usr/local/icinga2 Sysconf directory: /usr/local/icinga2/etc Run directory (base): /usr/local/icinga2/var/run Local state directory: /usr/local/icinga2/var

Internal paths: Package data directory: /usr/local/icinga2/share/icinga2 State path: /usr/local/icinga2/var/lib/icinga2/icinga2.state Modified attributes path: /usr/local/icinga2/var/lib/icinga2/modified-attributes.conf Objects path: /usr/local/icinga2/var/cache/icinga2/icinga2.debug Vars path: /usr/local/icinga2/var/cache/icinga2/icinga2.vars PID path: /usr/local/icinga2/var/run/icinga2/icinga2.pid


* Operating System and version:

$ cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"


* Enabled features (`icinga2 feature list`):

icinga2 feature list

Disabled features: command compatlog debuglog elasticsearch gelf graphite ido-mysql ido-pgsql influxdb2 livestatus opentsdb perfdata statusdata syslog Enabled features: api checker icingadb influxdb mainlog notification



* Icinga Web 2 version and modules (System - About):
![image](https://github.com/Icinga/icinga2/assets/18243635/1b6cf6e0-a39b-4002-8dfe-cf7412d9a2ba)
Al2Klimov commented 7 months ago

Hello Jose!

Does just Web mis-indicate the reachability or the Icinga 2 API, too?

Best, A/K

jjuanino commented 7 months ago

Hi Alexander,

in the icinga2 console I get the following (output snipped):

<1> => get_host("poc_child")
{
    __name = "poc_child"
    check_attempt = 1.000000
    check_command = "dummy"
    check_interval = 300.000000
    last_check_result = {
        active = true
        command = "dummy"
        exit_status = 0.000000
        output = "Check was successful."
        previous_hard_state = 99.000000
        vars_after = {
            attempt = 1.000000
            reachable = false    ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
            state = 0.000000
            state_type = 1.000000
        }
        vars_before = {
            attempt = 1.000000
            reachable = true
            state = 0.000000
            state_type = 1.000000
        }
    }
    last_reachable = false   ◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄◄
}

Best regards

Al2Klimov commented 5 months ago

The issue is as follows: when poc_grand_parent becomes down, poc_parent_0 becomes unreachable (as usual), but poc_child also, which is unexpected.

poc_child indeed seems to misbehave, but only after yet another check of itself after poc_grand_parent is down.

jjuanino commented 5 months ago

Yes, that is right, you have to check the services several times to reproduce the issue. The test presented is a bit contrived to show the behavior, but in the real world you get the issue in a more natural way. Regards.

nilmerg commented 2 months ago

I'm also able to reproduce this with just a single check now on the child.

nilmerg commented 1 month ago

Just checked the code to understand the behavior, but for another reason, and noticed the cause for this issue. Checkable::IsReachable checks whether any parent is unreachable before considering any redundancy groups. Redundancy groups only apply if all parents are reachable.