Icinga / icinga-core

Icinga 1.x, the old core (EOL 31.12.2018)
GNU General Public License v2.0
45 stars 27 forks source link

[dev.icinga.com #3489] Freshness check stuck due to is_being_freshened not being reset, using mod_gearman #1184

Closed icinga-migration closed 11 years ago

icinga-migration commented 11 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/3489

Created by buraks78 on 2012-12-12 01:49:03 +00:00

Assignee: (none) Status: Rejected (closed on 2013-02-21 19:17:58 +00:00) Target Version: (none) Last Update: 2013-02-21 19:17:58 +00:00 (in Redmine)

Icinga Version: icinga-1.7.2
OS Version: CentOS release 6.3 (Final)

My freshness check executes only once. Once executed, the is_being_freshened is never reset, and therefore, the freshness check never executes again. This only happens when I enable the mod_gearman broker.

My setup is below:

[root@centos62 icinga]# rpm -qa |grep icinga
icinga-1.7.2-1.el6.rf.x86_64
icinga-gui-1.7.2-1.el6.rf.x86_64
icinga-doc-1.7.2-1.el6.rf.x86_64
icinga-api-1.5.1-1.el6.rf.x86_64

[root@centos62 icinga]# rpm -qa | grep gearman
libgearman-0.14-3.el6.2.x86_64
gearmand-0.25-1.x86_64
gearmand-server-0.25-1.x86_64
mod_gearman-1.3.0-1.el6.x86_64

[root@centos62 icinga]# cat /etc/redhat-release 
CentOS release 6.3 (Final)

Here is configuration for the service:

define service{
        name                            generic-service         
        active_checks_enabled           1                      
        passive_checks_enabled          1                       
        parallelize_check               1                       
        obsess_over_service             1                      
        check_freshness                 0                       
        notifications_enabled           1                       
        event_handler_enabled           1                       
        flap_detection_enabled          1                      
        failure_prediction_enabled      1                      
        process_perf_data               1                       
        retain_status_information       1                     
        retain_nonstatus_information    1                      
        is_volatile                     0                      
        check_period                    24x7                  
        max_check_attempts              3                       
        normal_check_interval           15                      
        retry_check_interval            5                       
        notification_options            w,u,c,r               
        notification_interval           60                      
        notification_period             24x7                  
        register                        0               
        }

define service{
        name                    passive-service
        use                     generic-service
        active_checks_enabled           0                       
        max_check_attempts              1
        check_command                   check_dummy
        register                        0                   
        }

define service{
        name                            passive-perf-service
        use                             passive-service
        action_url /pnp4nagios/index.php/graph?host=$HOSTNAME$&srv=$SERVICEDESC$
        register                        0 
        }

define service{
        use                             passive-perf-service
        service_description             My Passive Check
        host_name                       fedora16.localdomain,centos58.localdomain
        servicegroups                   passive_group
        check_freshness                 1
        check_command                   check_mysql_replication
        check_interval                  10
        retry_interval                  5
        freshness_threshold             60
        max_check_attempts              5
        }

I have updated the source code and added some debugging code to checks.c. Here is code added to handle_async_service_check_result():

if (queued_check_result->check_options & CHECK_OPTION_FRESHNESS_CHECK) {
        temp_service->is_being_freshened = FALSE;
        if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
            log_debug_info(DEBUGL_CHECKS, 0, "fedora16.localdomain: CHECK_OPTION_FRESHNESS_CHECK matched: check_options: %s\n", queued_check_result->check_options);
        }
    } else {
        if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
            log_debug_info(DEBUGL_CHECKS, 0, "fedora16.localdomain: CHECK_OPTION_FRESHNESS_CHECK did not match: check_options: %s\n", queued_check_result->check_options);
        }
    }

    /* clear the execution flag if this was an active check */
    if (queued_check_result->check_type == SERVICE_CHECK_ACTIVE) {
        temp_service->is_executing = FALSE;
        if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
            log_debug_info(DEBUGL_CHECKS, 0, "fedora16.localdomain: My Passive Check service is considered active so is_executing is set to false\n");
        }
    } else {
        if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
            log_debug_info(DEBUGL_CHECKS, 0, "fedora16.localdomain: My Passive Check service is considered active so is_executing is left alone.\n");
        }
    }

And here is the code added to check_service_result_freshness():

/* skip services that are currently executing (problems here will be caught by orphaned service check) */
        if (temp_service->is_executing == TRUE) {
            if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
                log_debug_info(DEBUGL_CHECKS, 1, "fedora16.localdomain test is being executed.\n");
            }
            continue;
        }

        /* skip services that have both active and passive checks disabled */
        if (temp_service->checks_enabled == FALSE && temp_service->accept_passive_service_checks == FALSE)
            continue;

        /* skip services that are already being freshened */
        if (temp_service->is_being_freshened == TRUE) {
            if (strcmp(temp_service->host_name, "fedora16.localdomain") == 0 && strcmp(temp_service->description, "My Passive Check") == 0) {
                log_debug_info(DEBUGL_CHECKS, 1, "fedora16.localdomain test is being freshened.\n");
            }
            continue;
        }

The following output is logged when the freshness check is scheduled:

[1355270250.038105] [016.2] [pid=1996] Checking freshness of service 'My Passive Check' on host 'fedora16.localdomain'...
[1355270250.038211] [016.1] [pid=1996] Check results for service 'My Passive Check' on host 'fedora16.localdomain' are stale by 0d 23h 16m 14s (threshold=0d 0h 1m 0s).  Forcing an immediate check of the service...
[1355270250.038229] [016.1] [pid=1996] Scheduling service for fedora16.localdomain.
[1355270250.038252] [016.0] [pid=1996] Scheduling a forced, active check of service 'My Passive Check' on host 'fedora16.localdomain' @ Tue Dec 11 15:57:30 2012
[1355270250.038262] [016.2] [pid=1996] Scheduling new service check event for 'My Passive Check' on host 'fedora16.localdomain' @ Tue Dec 11 15:57:30 2012

The following output is logged for the first freshness check result processed by the handle_async_service_check_result() call:

[1355270741.091566] [016.1] [pid=1996] Handling check result for service 'My Passive Check' on host 'fedora16.localdomain'...
[1355270741.091585] [016.0] [pid=1996] ** Handling check result for service 'My Passive Check' on host 'fedora16.localdomain'...
[1355270741.091592] [016.1] [pid=1996] HOST: fedora16.localdomain, SERVICE: My Passive Check, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CHECK_MYSQL_REPLICATION CRITICAL: FRESHNESS CHECK: Replication is 1483583 seconds behind its master: centos58.localdomain | 'seconds behind'=1483583;60;3600\n
[1355270741.091605] [016.0] [pid=1996] fedora16.localdomain: CHECK_OPTION_FRESHNESS_CHECK did not match: check_options: (null)
[1355270741.091614] [016.0] [pid=1996] fedora16.localdomain: My Passive Check service is considered active so is_executing is set to false

As you can see above, queued_check_result->check_options is null so the is_being_freshened flag is never reset. As a result, after the initial check is performed, the freshness check is stuck in limbo as the check_service_result_freshness() call continuously skips the check. See the debugging output below:

[root@centos62 icinga]# cat icinga.debug | grep fedora
[1355272110.009044] [016.1] [pid=1996] fedora16.localdomain test is being freshened.
[1355272170.028035] [016.1] [pid=1996] fedora16.localdomain test is being freshened.
[1355272230.007756] [016.1] [pid=1996] fedora16.localdomain test is being freshened.
[1355272290.039679] [016.1] [pid=1996] fedora16.localdomain test is being freshened.
[1355272350.018998] [016.1] [pid=1996] fedora16.localdomain test is being freshened.
[1355272410.025875] [016.1] [pid=1996] fedora16.localdomain test is being freshened.

It looks like even though the check_flags are properly passed inside the check_service_result_freshness() call, they get lost/overwritten when the event is scheduled:

[1355275229.033675] [001.0] [pid=2079] schedule_service_check()
[1355275229.033710] [016.0] [pid=2079] Scheduling a forced, active check of service 'My Passive Check' on host 'fedora16.localdomain' @ Tue Dec 11 17:20:29 2012
*[1355275229.033762] [016.2] [pid=2079] Scheduling new service check event for 'My Passive Check' on host 'fedora16.localdomain' @ Tue Dec 11 17:20:29 2012*
[1355275229.033778] [001.0] [pid=2079] reschedule_event()
[1355275229.033785] [001.0] [pid=2079] add_event()
[1355275229.037491] [001.0] [pid=2079] handle_timed_event() end
[1355275229.037524] [001.0] [pid=2079] reschedule_event()
[1355275229.037535] [001.0] [pid=2079] add_event()
icinga-migration commented 11 years ago

Updated by mfriedrich on 2012-12-12 08:20:37 +00:00

afaik mod_gearman fiddles with the inner core objects in memory, as well as copies some algorithms the core has to do in order to achieve such checks. My guess - ofc without looking in the code as i do not have time right now - this is a mod_gearman issue, because as you mentioned, without the broker module, it works perfectly fine. Did you send your report to the mod_gearman bug tracker as well?

icinga-migration commented 11 years ago

Updated by mfriedrich on 2012-12-12 08:24:51 +00:00

icinga-migration commented 11 years ago

Updated by buraks78 on 2012-12-12 22:04:52 +00:00

I was not expecting mod_gearman to modify core objects. However, I have submitted a bug report there as well. Here is the link for reference: https://github.com/sni/mod\_gearman/issues/36

icinga-migration commented 11 years ago

Updated by mfriedrich on 2013-02-21 19:17:58 +00:00

since the core works fine without mod_gearman, marking this invalid in the means of a core issue. if mod_gearman fixes their bugby behaviour somehow, please discuss it over there.