Open icinga-migration opened 9 years ago
Updated by mfriedrich on 2015-04-17 15:49:23 +00:00
You are most likely encountering freshness checks afterall. Can you provide some logs and config examples showing an example in detail?
Updated by kruisdraad on 2015-04-17 20:29:16 +00:00
exact copy/paste incl. all dependencies into icinga (disable defeault command set)
then passivly submit AGENT_NTP results using live status so it goes into a OK state. Then reload icinga, you will see it defirectly going into 3/3 fault hard state
passive code snippet:
template Host "generic-host" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_ping" vars.check_ping_rta_warn = "5000" vars.check_ping_rta_crit = "5000" vars.check_ping_loss_warn = "100%" vars.check_ping_loss_crit = "100%" vars.check_ping_packets = "1" }
template Host "passive-host" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_dummy" vars.check_dummy_state = "2" vars.check_dummy_text = "No AGENT alive received from host!" }
template Service "passive-service" { max_check_attempts = 3 check_interval = 300s retry_interval = 300s check_command = "check_dummy" vars.check_dummy_state = "2" vars.check_dummy_text = "No AGENT check results received from host!" }
object CheckCommand "check_ping" { import "plugin-check-command" command = [ CustomPluginDir + "/check_ping" ] arguments = { "-H" = "$check_address$" "-w" = "$check_ping_rta_warn$,$check_ping_loss_warn$" "-c" = "$check_ping_rta_crit$,$check_ping_loss_crit$" "-p" = "$check_ping_packets$" } # Plugin default variable values vars.check_ping_rta_warn = "200" vars.check_ping_rta_crit = "400" vars.check_ping_loss_warn = "25%" vars.check_ping_loss_crit = "50%" vars.check_ping_packets = "4" }
apply Service "AGENT_NTP" { import "passive-service" check_command = "check_dummy" assign where host.vars.check_agent_ntp_enabled true ignore where host.vars.check_agent_ntp_enabled false }
object CheckCommand "check_dummy" { import "plugin-check-command" command = [ CustomPluginDir + "/check_dummy" ] arguments = { "--dummy" = { value = "$check_dummy_state$", skip_key = true } "--exittext" = { value = "$check_dummy_text$", skip_key = true } } vars.check_dummy_state = "3" vars.check_dummy_text = "This is a dummy check and does not check anything!" }
object Host "xxx.local.lan" { import "generic-host" vars.os = "ubuntu" vars.platform = "cf_ascc" vars.check_address = "127.0.0.1" vars.check_address_type = "IPv4" vars.check_agent_ntp_enabled = true }
Updated by mfriedrich on 2015-06-18 08:49:12 +00:00
Updated by mfriedrich on 2015-06-22 14:52:45 +00:00
Updated by mfriedrich on 2015-06-22 14:53:39 +00:00
There's a problem with freshness checking here - if active checks are disabled, freshness checks are not executed. Imho this is related to #7071 and requires a more sane and proper fix dealing with passive checks.
Updated by mfriedrich on 2016-03-04 15:50:08 +00:00
Updated by mfriedrich on 2016-03-04 16:52:54 +00:00
Updated by mfriedrich on 2016-03-09 10:53:59 +00:00
Requires a time estimate.
Updated by Animux on 2016-05-13 16:16:52 +00:00
I also noticed that behavior. The problem is, that icinga2 reschedules the active checks after a reload. I tried disabling the rescheduling and it fixed this bug. I do not know the intention of this code in lib/icinga/checkable.cpp:Checkable::Start(...)
. Why should we reschedule a check, if it is less than 5mins from now?
The code in lib/icinga/checkable-check.cpp:Checkable::UpdateNextCheck(..)
seems to try to distribute the check updates within the interval. But the calculated adj
may as big as the interval
and the check will be scheduled immediately. This may not be a problem for active checks, but it sure is a problem for passive checks, that should be scheduled at the arriving of the passive check results.
What should we do? We could simple remove the UpdateNextCheck
call from Checkable::Start
or we could do something like that:
diff --git a/lib/icinga/checkable.cpp b/lib/icinga/checkable.cpp
index 53db31d..f6bcfd2 100644
--- a/lib/icinga/checkable.cpp
+++ b/lib/icinga/checkable.cpp
@@ -44,7 +44,8 @@ void Checkable::Start(bool runtimeCreated)
{
double now = Utility::GetTime();
- if (GetNextCheck() < now + 300)
+ CheckResult::Ptr cr = GetLastCheckResult();
+ if (cr != NULL && cr->GetActive() && GetNextCheck() < now + 300)
UpdateNextCheck();
ObjectImpl::Start(runtimeCreated);
to only reschedule the checks, if the last check result was an active check. This would fix at least the problem for "stale" passive checks on reload.
Updated by mfriedrich on 2016-11-09 14:59:56 +00:00
Hello @kruisdraad,
did you create the issue? (Unfortunately we can only guess.)
If yes, is this problem still present in v2.10.2?
Best, AK
@Al2Klimov yes i did make this, however i no longer operate the large monitoring setup where i can easily reproduce it. I cannot help you at this time.
We re-schedule active and freshness checks the same way – which is a bit wrong IMAO:
I.e. on start if the next check would happen during the next minute, re-schedule it to a random point of time during the next minute. E.g. if the next check would happen at now+59s, re-schedule it to now+1s.
@N-o-X Shall we re-schedule them not earlier than they're scheduled at the moment?
I can confirm that it is still happening with 2.11.3, which makes #4200 even more annoying as my workaround for that was to trigger a restart whenever we create new services through the API, but that leads to constant CRITICAL services until they report in again.
Assuming that the reason for doing this rescheduling thing is to try to prevent too many checks from being clumped together to the same time, I think something like the following would be more reasonable:
double next_check = GetNextCheck();
double delta = std::min(GetCheckInterval(), 60.0);
delta *= (double)std::rand() / RAND_MAX;
if (next_check < now) {
SetNextCheck(now + delta);
} else if (next_check < now + 60) {
SetNextCheck(next_check + delta)
}
That way we never schedule a check earlier than it was already scheduled to be run, and we somewhat prevent them from being clumped up together on restarts.
@Al2Klimov @N-o-X Would something like that be acceptable?
PING @N-o-X
That way we never schedule a check earlier than it was already scheduled to be run
Hm... why not. But let's get a second opinion...
That way we never schedule a check earlier than it was already scheduled to be run
Hm... why not. But let's get a second opinion...
Because it's 1) never useful to do so and 2) actively breaks the whole concept of passive checks.
But I don't have to maintain an Icinga instance anymore so I no longer care about this.
That way we never schedule a check earlier than it was already scheduled to be run
Sounds like a good idea to me, doing so would only result in checking something more often than asked to. But I'd even go a step further: is there a reason to reschedule anything for which the next check is still in the future? I think it should be perfectly fine to keep any next check that's in the future as is and only reschedule checks that are due now so that they are spread out over some time instead of all running immediately.
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8905
Created by kruisdraad on 2015-03-30 12:29:24 +00:00
Assignee: (none) Status: New Target Version: (none) Last Update: 2016-11-09 14:59:55 +00:00 (in Redmine)
At the moment of a icinga reload/restart it will attempt to schedule the initial check of all services. Thats fine for active checks, but passive checks report in on their own.
As the active check part is a fallback check (usually a dummy check with a WARN/CRIT value) and results in a hard state directly (3/3) you will get a LOT of alerts.
Ideally you want to 'define' those services as passive and use the normal scheduling after starting. e.g. if the check_interval is 1m, then the first check would be 1m after startup.
enable_passive_checks is not documented and might be related in the handling of this 'event'