Closed icinga-migration closed 14 years ago
Updated by mfriedrich on 2010-09-14 07:59:44 +00:00
removed #581 - needs a more sophisticated fix/patch with proper testing.
Updated by mfriedrich on 2010-09-15 09:25:24 +00:00
[10:25:57] i was thinking about http://tracker.nagios.org/view.php?id=152 a bit
[10:27:03] setting run_event to true again when a servicecheck gets re-scheduled should be better
[10:27:15] so that a hostcheck afterwards will get executed
[10:27:50] state machine fidling is always funny
[10:29:58] the patch was the reason why you cannot disable active checks at the moment?
[10:30:20] yep
[10:30:28] i think so
[10:30:47] its reverted in cvs already, but i think the release was a bit too fast
[10:30:48] because run_event is always true in the middle of that
[10:31:54] the states are servicecheck => hostcheck
[10:32:09] if servicechecks are disabled, they get rescheduled, and run_event remains false
[10:32:40] it stays like that for upcoming hostcheck, which will then also being rescheduled isntead of being checked in case
[10:36:46] if servicecheck was set, and even if it was set to run_event=false, this causes a reschedule of the check and removing from event queue
[10:36:55] afterwards, run_event is set to true,
[10:37:04] the hostcheck is being skipped
[10:37:11] as it's a servicecheck
[10:37:21] run_event==TRUE matches
[10:37:27] so it gets executed
[10:37:43] base/events.c:1133
[10:38:12] so no matter what you do in the servicecheck condition it will always get executed
[10:38:19] hilarous fuckup
[10:38:27] i guess, noone will reject a patch ;)
[10:38:43] probably not
[10:39:15] need to check on that though
[10:41:58] setting run_event to true when rescheduling is also bad because it then hits directly the run_event condition
i would consider a logical rewrite of the state machine in order to run events straight from host or servicechecks, removing if(run_event==TRUE){
Updated by mfriedrich on 2010-09-18 11:38:18 +00:00
File added 0001-Hopefully-more-accurate-patch-for-maintaining-the-co.patch
-------- Original Message -------- Subject: Re: [Nagios-devel] Unable to stop executing checks with 3.2.2 Date: Fri, 17 Sep 2010 19:25:40 +0100 From: Stephen Gran Reply-To: Nagios Developers List To: nagios-devel@lists.sourceforge.net
On Fri, Sep 10, 2010 at 02:21:50PM +0100, Ton Voon said:
On 10 Sep 2010, at 13:51, Sven Nierlein wrote:
I changed the max_parallel_service_checks to 2 and svc2 is still not getting rescheduled.
So can anyone confirm if either the current code:
- causes breakage or
- means that the test is now invalid and svc2->next_check should == now
Ok, then i can confirm that current code breaks disabling active
host/service checks and the test is still valid.I think this is sufficient to remove the change. Done in CVS now. The
test passes on my system now, so will see what the other tinerbox
builds say tomorrow.
Hello,
I think the attached patch addresses the problem slightly more accurately than the original patch, logic in the attachment. If it's unclear or outright wrong, please let me know.
Updated by mfriedrich on 2010-09-20 08:22:49 +00:00
Updated by mfriedrich on 2010-09-20 08:24:57 +00:00
-------- Original Message --------
Subject: Re: [Nagios-devel] Unable to stop executing checks with 3.2.2
Date: Sun, 19 Sep 2010 11:53:08 +0200
From: Michael Friedrich
Reply-To: Nagios Developers List
To: nagios-devel@lists.sourceforge.net
Hi,
On 2010-09-18 19:57, Ton Voon wrote:
>
> The test_events.c passes with this change. However, it would be best
> if a testcase could be written for the problem this is trying to
> solve, which fails without the patch and passes with the patch. This
> will ensure the problem will continue to get visibility in future.
>
> If you create that testcase, I'd be more than happy to apply.
>
The patch addresses one of the proposed ways to fix the problem like
mentioned in the issue - http://tracker.nagios.org/view.php?id=152
"Possible solutions would be to reset run_events to TRUE after removing an
event from the queue, reset run_events to TRUE upon entering each "if"
block, or make the "if" into an "if else"."
The firstly committed patch had nothing to do with setting run_event the
proper way and resolving the issue. It was a bit screwed and misses some
logical explanation why it was done this way.
Either way, Stephen's solution will work as it breaks up the state
machine and makes sure changing the queue also affects the execution
during the while(1) loop.
Take this event queue
hc1 hc2 sc1 hc3 sc2 hc4 sc3 sc4 sc5
servicechecks for sc2 are disabled.
Looping the events happens like this, following the algorithm of the
original source:
1/ hc1 triggered, no servicecheck condition matched. run_event==TRUE,
hc1 gets executed in case not being disabled
2/ hc2 same as hc1, run_event is reset to TRUE at the beginning of the
loop with low prio event
3/ sc1 will be found during checking for servicecheck event. sc1
run_event will be TRUE as the checks are not disabled. also it won't get
removed from the queue. so
event_list_low->event_type==EVENT_HOST_CHECK
will not match afterwards.
4/ hc4 will be executed as it was like the rest above.
5/ sc2 is our special service check, which is being disabled. run_event starts with TRUE, then detection of service check event is true, but servicechecks are disabled.
So run_event changes to FALSE, and the event gets re-eschuled, removed from current event queue. Next is the check for hostcheck event, because it's a same priority "IF" and not next-time "ELSE IF".
In this case, hc4 is already visible to the queue checker, and the condition for a hostcheck event MATCHES, but having run_event on FALSE from before.
So in this case we won't have 6/ with hc4, but this happens all in 5/
hc4 is being checked for run_event==FALSE (which originates from sc2), and heureka, this is FALSE and hc4 as a host check event does not get executed too - but re-scheduled!
===
By taking Stephen's Patch, the mentioned ELSE IF will match whether service OR host check event. Currently it's an AND for matching.
Having one looping for those check events will make sure, they both won't disturb each other in detecting if the are
- disabled
- should be reschuled
- should be run
From my point of view, the logic can be rewritten too by removing that run_event flag and just putting the execution calls to where run_event was TRUE previously. But I haven't tried this time as there is more testing needed.
I would consider Stephen's patch as the patch for the buggy patch.
Kind regards,
Michael
-------- Original Message --------
Subject: Re: [Nagios-devel] Unable to stop executing checks with 3.2.2
Date: Sun, 19 Sep 2010 11:51:24 +0100
From: Stephen Gran
Reply-To: Nagios Developers List
To: nagios-devel@lists.sourceforge.net
On Sat, Sep 18, 2010 at 06:57:54PM +0100, Ton Voon said:
> The test_events.c passes with this change. However, it would be best
> if a testcase could be written for the problem this is trying to
> solve, which fails without the patch and passes with the patch. This
> will ensure the problem will continue to get visibility in future.
>
> If you create that testcase, I'd be more than happy to apply.
So I've been looking at the various test cases you already have, and I'm
not sure if it's going to be simpler to try to plug it into test_events.c
or create a new one, so I'm going to outline my thoughts and ask for a
recommendation.
The things that this patch addresses and should be tested are (in pseudo
code):
if execute_service_checks == 1 and execute_host_checks == 1:
failed service_check schedules a host_check and executes it
service_check is rescheduled
else if execute_service_checks == 0 and execute_host_checks == 1:
service_check is rescheduled instead of executed
host_check is scheduled at it's normal interval and executed
else if execute_service_checks == 1 and execute_host_checks == 0:
failed service_check schedules a host_check
host_check is rescheduled instead of run
service_check is rescheduled
else if execute_service_checks == 0 and execute_host_checks == 0:
a service_check is rescheduled
a host_check is rescheduled
That covers my understanding of how the scheduling should work.
Any advice on where to put that? I admit at first glance I am slightly
daunted by the prospect of learning your test harness for the logic
above, but I'm willing to try.
Cheers,
-------- Original Message --------
Subject: Re: [Nagios-devel] Unable to stop executing checks with 3.2.2
Date: Mon, 20 Sep 2010 08:51:32 +0100
From: Ton Voon
Reply-To: Nagios Developers List
To: Nagios Developers List
On 19 Sep 2010, at 11:51, Stephen Gran wrote:
> On Sat, Sep 18, 2010 at 06:57:54PM +0100, Ton Voon said:
>> The test_events.c passes with this change. However, it would be best
>> if a testcase could be written for the problem this is trying to
>> solve, which fails without the patch and passes with the patch. This
>> will ensure the problem will continue to get visibility in future.
>>
>> If you create that testcase, I'd be more than happy to apply.
>
> So I've been looking at the various test cases you already have, and
> I'm
> not sure if it's going to be simpler to try to plug it into
> test_events.c
> or create a new one
I don't mind as long as the test is written in a way that can be rerun
to confirm success in TAP output.
More docs at http://wiki.nagios.org/index.php/Nagios_Core_Developer_Guidelines#Testing
However, I'm guessing you'll find it easier to add more tests to
test_events.c than to create a whole new test_events2.c.
> , so I'm going to outline my thoughts and ask for a
> recommendation.
>
> The things that this patch addresses and should be tested are (in
> pseudo
> code):
>
> if execute_service_checks == 1 and execute_host_checks == 1:
> failed service_check schedules a host_check and executes it
> service_check is rescheduled
>
> else if execute_service_checks == 0 and execute_host_checks == 1:
> service_check is rescheduled instead of executed
> host_check is scheduled at it's normal interval and executed
>
> else if execute_service_checks == 1 and execute_host_checks == 0:
> failed service_check schedules a host_check
> host_check is rescheduled instead of run
> service_check is rescheduled
>
> else if execute_service_checks == 0 and execute_host_checks == 0:
> a service_check is rescheduled
> a host_check is rescheduled
>
> That covers my understanding of how the scheduling should work.
This looks about right though, to be honest, I've never delved deeply
into this logic so I can't verify.
>
> Any advice on where to put that? I admit at first glance I am
> slightly
> daunted by the prospect of learning your test harness for the logic
> above, but I'm willing to try.
The first thing is: can you get the test_events C code to execute
without failure?
The 2nd is to code your scenarios above. The tricky thing here is that
you have to setup all the appropriate variables, then execute
event_execution_loop(). As you success criteria is based on whether
certain events were executed or not, you could probably work this out
by checking the last_check and next_check for specific host/services
afterwards.
I look forward to see those tests!
Ton
Updated by mfriedrich on 2010-09-20 15:32:53 +00:00
i've added a bit more debugoutput especially for this case in order to see when host and servicecheck happen by name (same loop cycle)
and, there's the result with currently, unpatched release:
[1284996509.248159] [008.1] [pid=26787] ** Event Check Loop
[1284996509.248166] [008.1] [pid=26787] Next High Priority Event Time: Mon Sep 20 17:28:34 2010
[1284996509.248171] [008.1] [pid=26787] Next Low Priority Event Time: Mon Sep 20 17:28:29 2010
[1284996509.248174] [008.1] [pid=26787] Current/Max Service Checks: 0/0
[1284996509.248179] [024.1] [pid=26787] Run a few checks before executing a service check for 'test_ok_16'.
[1284996509.248182] [024.1] [pid=26787] We're not executing service checks right now, so we'll skip this event.
[1284996509.248185] [024.1] [pid=26787] Skip event, removing service 'test_ok_16' from list.
[1284996509.248231] [024.1] [pid=26787] Run a few checks before executing a host check for 'test_host_071'.
[1284996509.248234] [024.1] [pid=26787] Skip event, removing host 'test_host_071' from list.
[1284996509.248307] [008.1] [pid=26787] ** Event Check Loop
[1284996509.248314] [008.1] [pid=26787] Next High Priority Event Time: Mon Sep 20 17:28:34 2010
[1284996509.248319] [008.1] [pid=26787] Next Low Priority Event Time: Mon Sep 20 17:28:57 2010
[1284996509.248322] [008.1] [pid=26787] Current/Max Service Checks: 0/0
test_ok_16 service is disabled globally, being rescheduled and removed from event queue. afterwards, it should go straight into next looping since nothing more to be done. but it doesn't next in line is test_host_071 which gets skipped too.
Updated by mfriedrich on 2010-09-21 06:16:14 +00:00
with the patch being applied, the event looping is doing just fine.
[1285049507.073382] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.073390] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:47 2010
[1285049507.073396] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049507.073403] [024.1] [pid=421] Run a few checks before executing a service check for 'test_ok_19'.
[1285049507.073409] [024.1] [pid=421] We're not executing service checks right now, so we'll skip this event.
[1285049507.073415] [024.1] [pid=421] Skip event, removing service 'test_ok_19' from list.
[1285049507.073501] [008.1] [pid=421] ** Event Check Loop
[1285049507.073509] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.073519] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:47 2010
[1285049507.073524] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049507.073532] [024.1] [pid=421] Run a few checks before executing a service check for 'test_ok_19'.
[1285049507.073538] [024.1] [pid=421] We're not executing service checks right now, so we'll skip this event.
[1285049507.073544] [024.1] [pid=421] Skip event, removing service 'test_ok_19' from list.
[1285049507.073653] [008.1] [pid=421] ** Event Check Loop
[1285049507.073662] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.073671] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.073677] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049507.325743] [008.1] [pid=421] ** Event Check Loop
[1285049507.325816] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.325829] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.325864] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049507.577854] [008.1] [pid=421] ** Event Check Loop
[1285049507.577936] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.577951] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.577958] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049507.829890] [008.1] [pid=421] ** Event Check Loop
[1285049507.829957] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.829969] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049507.829976] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049508.080888] [008.1] [pid=421] ** Event Check Loop
[1285049508.080961] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.080973] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.080979] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049508.331897] [008.1] [pid=421] ** Event Check Loop
[1285049508.331965] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.331979] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.331986] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049508.582894] [008.1] [pid=421] ** Event Check Loop
[1285049508.582967] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.582980] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.582986] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049508.833931] [008.1] [pid=421] ** Event Check Loop
[1285049508.833993] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.834005] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049508.834012] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049509.085932] [008.1] [pid=421] ** Event Check Loop
[1285049509.086015] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049509.086030] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049509.086036] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049509.086072] [008.0] [pid=421] ** Timed Event ** Type: 5, Run Time: Tue Sep 21 08:11:49 2010
[1285049509.086079] [008.0] [pid=421] ** Check Result Reaper
[1285049509.095695] [008.1] [pid=421] ** Event Check Loop
[1285049509.095709] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:59 2010
[1285049509.095721] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:11:49 2010
[1285049509.095727] [008.1] [pid=421] Current/Max Service Checks: 0/0
[1285049509.095736] [024.1] [pid=421] Run a few checks before executing a host check for 'test_host_096'.
[1285049509.095743] [008.1] [pid=421] Running event...
[1285049509.095758] [008.0] [pid=421] ** Timed Event ** Type: 12, Run Time: Tue Sep 21 08:11:49 2010
[1285049509.095764] [008.0] [pid=421] ** Host Check Event ==> Host: 'test_host_096', Options: 0, Latency: 0.095000 sec
[1285049509.099893] [008.1] [pid=421] ** Event Check Loop
[1285049509.099949] [008.1] [pid=421] Next High Priority Event Time: Tue Sep 21 08:11:59 2010
[1285049509.099961] [008.1] [pid=421] Next Low Priority Event Time: Tue Sep 21 08:12:11 2010
[1285049509.099967] [008.1] [pid=421] Current/Max Service Checks: 0/0
Updated by mfriedrich on 2010-09-21 11:29:31 +00:00
test cases from Stephen:
-------- Original Message --------
Subject: Re: [Nagios-devel] Unable to stop executing checks with 3.2.2
Date: Tue, 21 Sep 2010 00:22:36 +0100
From: Stephen Gran
Reply-To: Nagios Developers List
To: nagios-devel@lists.sourceforge.net
On Mon, Sep 20, 2010 at 08:51:32AM +0100, Ton Voon said:
> The first thing is: can you get the test_events C code to execute
> without failure?
>
> The 2nd is to code your scenarios above. The tricky thing here is that
> you have to setup all the appropriate variables, then execute
> event_execution_loop(). As you success criteria is based on whether
> certain events were executed or not, you could probably work this out
> by checking the last_check and next_check for specific host/services
> afterwards.
>
> I look forward to see those tests!
Hi there,
This is my simple minded attempt at matching the originally reported
failure case (host checks being skipped when service checks disabled).
It fails without my patch and passes with my patch. Hopefully it's
clear enough - please ask for clarification if not.
Cheers,
Updated by mfriedrich on 2010-09-21 13:05:21 +00:00
Applied in changeset commit:"7bfd6520a55a7e7125dd642268ad9cd8b372429a".
This issue has been migrated from Redmine: https://dev.icinga.com/issues/778
Created by mfriedrich on 2010-09-10 18:14:55 +00:00
Assignee: mfriedrich Status: Resolved (closed on 2010-09-21 13:05:21 +00:00) Target Version: 1.2 (Stable) Last Update: 2010-09-21 13:05:21 +00:00 (in Redmine)
http://sourceforge.net/mailarchive/message.php?msg\_name=4C8A14D8.3030903%40consol.de
Attachments
Changesets
2010-09-21 11:39:21 +00:00 by mfriedrich 7bfd6520a55a7e7125dd642268ad9cd8b372429a
Relations: