Icinga / icinga-core

Icinga 1.x, the old core (EOL 31.12.2018)
GNU General Public License v2.0
45 stars 27 forks source link

Service hard state gets reset to soft state after host recovers from down state #1594

Closed mittma closed 6 years ago

mittma commented 7 years ago

We had a server reboot that took too long, which notified a service problem, but no recovery message was sent out after the service was running again.

According to our logs the service state was reset from hard to soft after the host recovered from its short down state (check_interval is 5, retry_interval is 5 and maxcheck attempts is 4):

[06:55:00] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;SOFT;1;CRITICAL cpsvc is Stopped [07:00:03] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;SOFT;2;CRITICAL cpsvc is Stopped [07:05:01] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;SOFT;3;CRITICAL cpsvc is Stopped [07:10:01] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;HARD;4;CRITICAL cpsvc is Stopped [07:10:01] SERVICE NOTIFICATION: team-incident;host123;XenApp Core Services;CRITICAL;notify-service-by-incident;CRITICAL cpsvc is Stopped [07:14:25] HOST ALERT: host123;DOWN;SOFT;1;CRITICAL: IPv4/host123.contoso.com CRITICAL [07:17:27] HOST ALERT: host123;UP;SOFT;2;OK: IPv4/host123.contoso.com OK [07:20:01] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;SOFT;1;CRITICAL cpsvc is Stopped [07:25:01] SERVICE ALERT: host123;XenApp Core Services;CRITICAL;SOFT;2;CRITICAL cpsvc is Stopped [07:30:03] SERVICE ALERT: host123;XenApp Core Services;OK;SOFT;3;OK cpsvc is Running

No recovery notification was sent because it was a soft recovery - which is wrong!?

Google found an old bug (note 0000139) which seems to be right:

In checks.c the number of the current_attempt gets set to 1, if the host is down:

  if (route_result != HOST_UP) {

      log_debug_info(DEBUGL_CHECKS, 2, "Host is not UP, so we mark state changes if appropriate\n");

      ...SKIP...

      /* put service into a hard state without attempting check retries and don't send out notifications about it */
      temp_service->host_problem_at_last_check = TRUE;
      temp_service->state_type = HARD_STATE;
      temp_service->last_hard_state = temp_service->current_state;
      temp_service->current_attempt = 1;
  }

This causes the service state to be set to SOFT as soon as the host recovers:

  /* if we should retry the service check, do so (except it the host is down or unreachable!) */
  if (temp_service->current_attempt < temp_service->max_attempts) {

      /* the host is down or unreachable, so don't attempt to retry the service check */
      if (route_result != HOST_UP) {

      ...SKIP...

      }

      /* the host is up, so continue to retry the service check */
      else {

          log_debug_info(DEBUGL_CHECKS, 1, "Host is UP, so we'll retry the service check...\n");

          /* this is a soft state */
          if (temp_service->current_attempt < temp_service->max_attempts) {
              temp_service->state_type = SOFT_STATE;
          }

      ...SKIP...

      }

  ...SKIP...

  }

In my opinion the current_attempt shouldn't be set to 1 (but I'm not sure if that causes any problems in other parts of the code).

dnsmichi commented 7 years ago

Probably it will break other things, which is why we need proper tests for such changes.

dnsmichi commented 7 years ago

I did some research in the last couple of days. You're right, resetting the counter is pretty much wrong. It tries to fake a hard state to prevent notifications as such. Later on this leads into the SOFT recovery.

The Nagios code removed that code parts a while ago.

https://github.com/NagiosEnterprises/nagioscore/commit/7477b95ce211335d3ff9ced45595d903d0f95d31 http://tracker.nagios.org/view.php?id=128 (Patch at the end is not applied in Icinga). https://github.com/NagiosEnterprises/nagioscore/commit/38050baa4bd20574b148602a3a34c3b194b2a300 (patch is applied in Icinga)

90af9261c41525de9085b5ea658a265f456e9f85

Note: This if condition https://github.com/Icinga/icinga-core/blob/master/base/checks.c#L1645 is redundant to the if condition already enclosing it here https://github.com/Icinga/icinga-core/blob/master/base/checks.c#L1618 (dear god, that code would need cleanup)

I'd guess that removing the block with HARD_STATE until current_attempt = 1 (3 lines) does not hurt. It will leave the service it its current state. Can you try commenting those out and check whether it fixes your problem?