Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Passive checks with freshness interval switch immediately to HARD state #8383

Closed milan-koudelka closed 3 years ago

milan-koudelka commented 4 years ago

Describe the bug

We are using passive checks with freshness interval. Unfortunately, when we deploy new host with passive check. The check is switch to HARD failed state almost immediately. It should be created with service.last_check time set to current time and wait until check_interval time expire for newly deployed service.

Please ensure to read https://github.com/Icinga/icinga2/blob/master/doc/15-troubleshooting.md first. Formatting tips: GitHub supports Markdown: https://guides.github.com/features/mastering-markdown/

To Reproduce

Provide a link to a live example, or an unambiguous set of steps to reproduce this bug. Include configuration, logs, etc. to reproduce, if relevant.

  1. create a host with service in passive mode and active checks enabled for freshness interval (eg. 24 hours)
  2. (re)start Icinga2
  3. the service will be in pending state for a while. You can debug service.last_check time and it will be 1970-01-01 00:59:59 +0100
  4. after a minute the freshness check will run and check status switch to predefined status state, usually UNKNOWN

Expected behavior

When I deploy new host, I don't expect that I will receive all passive check results in a minute. Passive checks run usually in longer interval (eg. backup task). I expect that passive check state will be somehow initiated at creation.

Screenshots

If applicable, add screenshots to help explain your problem.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb livestatus opentsdb perfdata statusdata syslog Enabled features: api checker command ido-mysql mainlog notification

Additional context

template Service "passive-service" {

check_command = "dummy"
enable_passive_checks = 1
enable_active_checks = 1
check_interval = 30m

#disable notifications in case host has notifications disabled also
if (host.enable_notifications){
  enable_notifications = true
} else {
  enable_notifications = false
}

/* Use a runtime function to retrieve the last check time and more details. */
vars.dummy_text = {{
  var service = get_service(macro("$host.display_name$"), macro("$service.name$"))
  var lastCheck = DateTime(service.last_check).to_string()
  return "No check results received. Last result time: " + lastCheck + " " + service.check_attempt + " / " + service.state_type
}}
vars.dummy_state = 3
vars.enable_pagerduty = true

}

object Host "stg3-connectors-vertica01.XXX" { import "develop-host" display_name = "stg3-connectors-vertica01.XXX" address = "XXX" vars.gdc_services = [ "vertica-backup" ] enable_notifications = true }

apply Service "vertica backup" { import "passive-service" display_name = "Vertica backup" check_interval = 29h max_check_attempts = 1 assign where "vertica-backup" in host.vars.gdc_services }

Al2Klimov commented 4 years ago

Hm.. we could make now the default last check time. But passive checks are enabled by default, so we can't tell easily and reliably whether a service is actually checked passively. If we'd change the default for all services with passive checks enabled, we'd also delay the active checks, so effectively all new services would be pending for – say – 5m.

@lippserd Could everyone live w/ the latter?

milan-koudelka commented 3 years ago

@Al2Klimov Do I understand you correctly, that you would change the behavior that the first check would be run after check_interval and it doesn't care if it is active or passive?

Al2Klimov commented 3 years ago

Yes, that’s what I suggested.

milan-koudelka commented 3 years ago

I think that it would help even the performance of Icinga. When I deploy a lot of new hosts, all checks are queued immediately, even that some of them I would like to check hourly/daily (eg. certificate expiration). But I can understand, that it would painful to find out that you have an invalid certificate on your new host after a day in a production environment :-D

lippserd commented 3 years ago

Hm.. we could make now the default last check time. But passive checks are enabled by default, so we can't tell easily and reliably whether a service is actually checked passively. If we'd change the default for all services with passive checks enabled, we'd also delay the active checks, so effectively all new services would be pending for – say – 5m.

@lippserd Could everyone live w/ the latter?

I don't think this is an option due to the different check intervals. The first freshness check should only be triggered when the check interval has been exceeded. Maybe we need a creation_time attribute for that - we don't have it, right?

Al2Klimov commented 3 years ago

Right.

Do you consider a such workaround reasonable? https://community.icinga.com/t/downtime-for-new-hosts/3819/3?u=al2klimov

milan-koudelka commented 3 years ago

I read that thread about creationtime and downtime to all hosts. It is a nice solution. I'm afraid if it will truly work at all and even more for this case.

1/ We used a similar approach, after deploying a new host we set the downtime for 1 hour through API immediately for a host and all its services. It was faulty. Sometimes because it took some time to process the request we tried to set downtime which had start time in the past and Icinga had problems which such settings. We switched to the solution where we disable notifications and we enable them after one hour. Your solution is better, it is just small code, no API calls probably can be tuned to set downtime also to all services. However, will be Icinga happy if the creation time will be in past? Will it work then? I'm not sure if this downtime in past will be applied also for older hosts that already exist in the configuration. That would mean a lot of downtime objects in Icinga2 which can lead to worse performance.

2/ For this case, we would have to set downtime for passive check services for interval based on freshness check interval. I'm not sure if I can do that with some code similar to the one you mentioned in the thread.

For context, I copy-paste the code I mentioned in this comment below.

object Host "example.com" {
  vars.created_at = 1234567890
}

apply Downtime "pre-prod" to Host {
  assign where true
  start_time = host.vars.created_at
  end_time = start_time + 1h * 24 * 7
}
Al2Klimov commented 3 years ago
  1. You should only miss the Downtime start notification (#7896) and too old downtimes vanish ASAP.
  2. You just... apply Downtime "pre-prod" to Service?
milan-koudelka commented 3 years ago

Ok, I could do something like this.

object Host "example.com" {
  vars.created_at = 1234567890
}

object Downtime "pre-production-downtime" to Service {
  author = "icingaadmin"
  comment = "Scheduled downtime for new passive checks until first freshness check interval expire"
  start_time = host.vars.created_at
  end_time = start_time + service.check_interval
  assign where true
}

Or maybe I don't need to care about created_at and check_inteval at all. All new checks have service.last_check set to 1970-01-01. I can probably disable active checks for these checks completely until at least one check result is reported. It can be dangerous if the check doesn't work from the beginning at all.

template Service "passive-service" {

    check_command = "passive"
    enable_passive_checks = 1

    #disable notifications in case host has notifications disabled also
    if (host.enable_notifications){
      enable_notifications = true
    } else {
      enable_notifications = false
    }

    /* Use a runtime function to retrieve the last check time and more details. */
    vars.dummy_text = {{
      var service = get_service(macro("$host.display_name$"), macro("$service.name$"))
      var lastCheck = DateTime(service.last_check).to_string()
      return "No check results received. Last result time: " + lastCheck
    }}
    vars.dummy_state = 3
    vars.enable_pagerduty = true

    if (vars.service.last_check < 1){
      enable_active_checks = 1
    }else{
      enable_active_checks = 0
    }
    check_interval = 30m

}
milan-koudelka commented 3 years ago

Hms, no the second option doesn't work vars.service.last_check is not defined as I thought would be.

Al2Klimov commented 3 years ago

Let us know once you’ve found a reasonable workaround.

Al2Klimov commented 3 years ago

And yet another suggestion – let the first freshness check be OK:

template Host "passive" {
  check_command = "dummy"

  var that = this
  vars.dummy_state = function() use(that) {
    return if (that.last_check_result) { 3 } else { 0 }
  }
}
milan-koudelka commented 3 years ago

@Al2Klimov This is super cool. That is probably what I was looking for. The first immediate dummy freshness check returns OK and then it works as usual. Thank you!