Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 576 forks source link

[dev.icinga.com #8905] Reloading icinga schedules (unwanted) passive checks #2848

Open icinga-migration opened 9 years ago

icinga-migration commented 9 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8905

Created by kruisdraad on 2015-03-30 12:29:24 +00:00

Assignee: (none) Status: New Target Version: (none) Last Update: 2016-11-09 14:59:55 +00:00 (in Redmine)

Icinga Version: r2.3.2-1
Backport?: Not yet backported
Include in Changelog: 1

At the moment of a icinga reload/restart it will attempt to schedule the initial check of all services. Thats fine for active checks, but passive checks report in on their own.

As the active check part is a fallback check (usually a dummy check with a WARN/CRIT value) and results in a hard state directly (3/3) you will get a LOT of alerts.

Ideally you want to 'define' those services as passive and use the normal scheduling after starting. e.g. if the check_interval is 1m, then the first check would be 1m after startup.

enable_passive_checks is not documented and might be related in the handling of this 'event'

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-04-17 15:49:23 +00:00

You are most likely encountering freshness checks afterall. Can you provide some logs and config examples showing an example in detail?

icinga-migration commented 9 years ago

Updated by kruisdraad on 2015-04-17 20:29:16 +00:00

exact copy/paste incl. all dependencies into icinga (disable defeault command set)

then passivly submit AGENT_NTP results using live status so it goes into a OK state. Then reload icinga, you will see it defirectly going into 3/3 fault hard state

passive code snippet:

template Host "generic-host" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_ping" vars.check_ping_rta_warn = "5000" vars.check_ping_rta_crit = "5000" vars.check_ping_loss_warn = "100%" vars.check_ping_loss_crit = "100%" vars.check_ping_packets = "1" }

template Host "passive-host" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_dummy" vars.check_dummy_state = "2" vars.check_dummy_text = "No AGENT alive received from host!" }

template Service "passive-service" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_dummy" vars.check_dummy_state = "2" vars.check_dummy_text = "No AGENT check results received from host!" }

object CheckCommand "check_ping" { import "plugin-check-command" command = [ CustomPluginDir + "/check_ping" ] arguments = { "-H" = "$check_address$" "-w" = "$check_ping_rta_warn$,$check_ping_loss_warn$" "-c" = "$check_ping_rta_crit$,$check_ping_loss_crit$" "-p" = "$check_ping_packets$" } # Plugin default variable values vars.check_ping_rta_warn = "200" vars.check_ping_rta_crit = "400" vars.check_ping_loss_warn = "25%" vars.check_ping_loss_crit = "50%" vars.check_ping_packets = "4" }

apply Service "AGENT_NTP" { import "passive-service" check_command = "check_dummy" assign where host.vars.check_agent_ntp_enabled true ignore where host.vars.check_agent_ntp_enabled false }

object CheckCommand "check_dummy" { import "plugin-check-command" command = [ CustomPluginDir + "/check_dummy" ] arguments = { "--dummy" = { value = "$check_dummy_state$", skip_key = true } "--exittext" = { value = "$check_dummy_text$", skip_key = true } } vars.check_dummy_state = "3" vars.check_dummy_text = "This is a dummy check and does not check anything!" }

object Host "xxx.local.lan" { import "generic-host" vars.os = "ubuntu" vars.platform = "cf_ascc" vars.check_address = "127.0.0.1" vars.check_address_type = "IPv4" vars.check_agent_ntp_enabled = true }

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-06-18 08:49:12 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-06-22 14:52:45 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-06-22 14:53:39 +00:00

There's a problem with freshness checking here - if active checks are disabled, freshness checks are not executed. Imho this is related to #7071 and requires a more sane and proper fix dealing with passive checks.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-04 15:50:08 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-04 16:52:54 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-09 10:53:59 +00:00

Requires a time estimate.

icinga-migration commented 8 years ago

Updated by Animux on 2016-05-13 16:16:52 +00:00

I also noticed that behavior. The problem is, that icinga2 reschedules the active checks after a reload. I tried disabling the rescheduling and it fixed this bug. I do not know the intention of this code in lib/icinga/checkable.cpp:Checkable::Start(...). Why should we reschedule a check, if it is less than 5mins from now?

The code in lib/icinga/checkable-check.cpp:Checkable::UpdateNextCheck(..) seems to try to distribute the check updates within the interval. But the calculated adj may as big as the interval and the check will be scheduled immediately. This may not be a problem for active checks, but it sure is a problem for passive checks, that should be scheduled at the arriving of the passive check results.

What should we do? We could simple remove the UpdateNextCheck call from Checkable::Start or we could do something like that:

diff --git a/lib/icinga/checkable.cpp b/lib/icinga/checkable.cpp
index 53db31d..f6bcfd2 100644
--- a/lib/icinga/checkable.cpp
+++ b/lib/icinga/checkable.cpp
@@ -44,7 +44,8 @@ void Checkable::Start(bool runtimeCreated)
 {
        double now = Utility::GetTime();

-       if (GetNextCheck() < now + 300)
+       CheckResult::Ptr cr = GetLastCheckResult();
+       if (cr != NULL && cr->GetActive() && GetNextCheck() < now + 300)
                UpdateNextCheck();

        ObjectImpl::Start(runtimeCreated);

to only reschedule the checks, if the last check result was an active check. This would fix at least the problem for "stale" passive checks on reload.

icinga-migration commented 7 years ago

Updated by mfriedrich on 2016-11-09 14:59:56 +00:00

Al2Klimov commented 5 years ago

Hello @kruisdraad,

did you create the issue? (Unfortunately we can only guess.)

If yes, is this problem still present in v2.10.2?

Best, AK

kruisdraad commented 5 years ago

@Al2Klimov yes i did make this, however i no longer operate the large monitoring setup where i can easily reproduce it. I cannot help you at this time.

Al2Klimov commented 4 years ago

We re-schedule active and freshness checks the same way – which is a bit wrong IMAO:

https://github.com/Icinga/icinga2/blob/7ed690610fedde3e503b35f01ba68b3090496ca5/lib/icinga/checkable.cpp#L66-L70

I.e. on start if the next check would happen during the next minute, re-schedule it to a random point of time during the next minute. E.g. if the next check would happen at now+59s, re-schedule it to now+1s.

@N-o-X Shall we re-schedule them not earlier than they're scheduled at the moment?

kyrias commented 4 years ago

I can confirm that it is still happening with 2.11.3, which makes #4200 even more annoying as my workaround for that was to trigger a restart whenever we create new services through the API, but that leads to constant CRITICAL services until they report in again.

kyrias commented 4 years ago

Assuming that the reason for doing this rescheduling thing is to try to prevent too many checks from being clumped together to the same time, I think something like the following would be more reasonable:

double next_check = GetNextCheck();
double delta = std::min(GetCheckInterval(), 60.0); 
delta *= (double)std::rand() / RAND_MAX; 
if (next_check < now) {
    SetNextCheck(now + delta);
} else if (next_check < now + 60) {
    SetNextCheck(next_check + delta)
}

That way we never schedule a check earlier than it was already scheduled to be run, and we somewhat prevent them from being clumped up together on restarts.

@Al2Klimov @N-o-X Would something like that be acceptable?

Al2Klimov commented 3 years ago

PING @N-o-X

Al2Klimov commented 2 years ago

That way we never schedule a check earlier than it was already scheduled to be run

Hm... why not. But let's get a second opinion...

kyrias commented 2 years ago

That way we never schedule a check earlier than it was already scheduled to be run

Hm... why not. But let's get a second opinion...

Because it's 1) never useful to do so and 2) actively breaks the whole concept of passive checks.

But I don't have to maintain an Icinga instance anymore so I no longer care about this.

julianbrost commented 2 years ago

That way we never schedule a check earlier than it was already scheduled to be run

Sounds like a good idea to me, doing so would only result in checking something more often than asked to. But I'd even go a step further: is there a reason to reschedule anything for which the next check is still in the future? I think it should be perfectly fine to keep any next check that's in the future as is and only reschedule checks that are due now so that they are spread out over some time instead of all running immediately.