Closed icinga-migration closed 6 years ago
Updated by mckslim on 2014-09-29 15:34:51 +00:00
anyone out there to look at my problem?
Updated by mfriedrich on 2014-10-24 22:24:16 +00:00
Sorry, but the core dev team is currently bound to Icinga 2 and its upcoming 2.2 release. It won't help much asking when someone will look into it - currently no-one will do, so you might give it a go on your own within the source code.
I suspect that #4494 is a similar issue, which has been rejected a while ago working as designed.
Updated by mfriedrich on 2014-10-24 22:25:54 +00:00
Updated by mfriedrich on 2014-10-24 22:26:23 +00:00
Updated by mfriedrich on 2014-10-24 22:27:26 +00:00
Updated by mckslim on 2014-10-24 23:09:05 +00:00
dnsmichi wrote:
Sorry, but the core dev team is currently bound to Icinga 2 and its upcoming 2.2 release. It won't help much asking when someone will look into it - currently no-one will do, so you might give it a go on your own within the source code.
I suspect that #4494 is a similar issue, which has been rejected a while ago working as designed.
Thanks for the reply. We'll be patient hoping you can get to this bug in the future. On 4944, that is the same basic issue. In that case I was doing things with putting the host in passive mode to try to recreate it on the fly more easily, to which you said it was working as designed in that case. But, this 7277 report is on the issue when things are using normal processing, not when I use passive checking; so, it is more valid as a real bug.
I see the other bugs you put in as being possibly related. I don't know if I can figure things out from there. It would certainly be great if any of you Icinga developers could still figure it out of course, when possible. thanks again
Updated by mfriedrich on 2014-10-24 23:34:46 +00:00
I currently have no idea how to fix that race condition, since the 1.x reload isn't really one compared to how apache2 or icinga2 does it. A SIGHUP signal stops the entire process (event queue), re-reads the configuration, and then starts to do checks again. if the currently executed check in its child process returns within that small time window, it writes the checkresult to the spooldir, where the reaper then gets the result from in a defined polling interval. meanwhile the reload operation must be finished, so icinga core has no information at that point that the previously scheduled check returned ok. My guess is that's impossible to store/control what to do with child processes still running during a reload - killing them in favor of the reload action isn't option, neither is waiting for them to return (imagine long lasting checks).
Having that said, the Nagios inherited code as a fork got many flaws which are either unfixable as they are broken by design, or remain being bugs for a long time as they require long term debugging and the fixes are not trivial in that scope. That's something at least I am not willing to do in my spare time. Icinga 2 is more fun to play with, and that's it for now. The reload in there is not broken, and works in a reasonable parallel manner Since you requested feedback twice, I thought I'd give you an update (it's nearly 2am here), but in the end my todo list says otherwise in the next weeks. I won't take this bug into account for 1.12 either, unless someone provides a fully tested patch in the next 2 weeks.
Updated by mfriedrich on 2014-10-26 20:01:39 +00:00
One thought on the reload and checkresults:
case EVENT_CHECK_REAPER:
log_debug_info(DEBUGL_EVENTS, 0, "** Check Result Reaper\n");
/* reap host and service check results */
reap_check_results();
break;
it will break out of its operation when sigrestart == true
int reap_check_results(void) {
...
/* read all check results that have come in... */
while ((queued_check_result = read_check_result())) {
...
/* bail out if we encountered a signal */
if (sigshutdown == TRUE || sigrestart == TRUE) {
log_debug_info(DEBUGL_CHECKS, 0, "Breaking out of check result reaper: signal encountered\n");
break;
}
}
log_debug_info(DEBUGL_CHECKS, 0, "Finished reaping %d check results\n", reaped_checks);
log_debug_info(DEBUGL_FUNCTIONS, 0, "reap_check_results() end\n");
return OK;
}
Generally speaking that function calls process_check_result_queue() first to read all files into core memory. process_check_result_file() reads the file, and add_check_result_to_list(new_cr); will put it into the list. That list is processed by the reaper then.
If there's a restart signal coming in, processing is interrupted. The checkresult reaper will continue its operation only within the next event for reaping the checkresults. Which means that this list might be longer with new checkresults.
There could be some limiting settings here:
I'm not sure if that's the real issue, but it could be a problem in terms of processing checkresults in the proper way. Icinga 1.x cannot be changed to read checkresults into memory and directly update the hoststatus. That's one of the reasons for Icinga 2 where this is properly implemented.
For further insights, please attach your icinga.cfg and a debug log with at least the following levels enabled:
# 4 = Process information
# 8 = Scheduled events
# 16 = Host/service checks
and add the following details
Updated by mfriedrich on 2014-10-26 20:03:14 +00:00
Updated by mckslim on 2014-10-29 21:10:57 +00:00
OK, I enabled debug log with this in 'icinga.log':
I have debug log with the problem in it, what do you want to see?
and ...
System health: top - 21:09:43 up 3 days, 20:17, 1 user, load average: 0.46, 0.55, 0.70 Tasks: 429 total, 2 running, 427 sleeping, 0 stopped, 0 zombie Cpu(s): 3.5%us, 1.2%sy, 0.0%ni, 94.9%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 41157752k total, 8310168k used, 32847584k free, 1610464k buffers Swap: 4192956k total, 0k used, 4192956k free, 4542864k cached
Updated by mckslim on 2014-10-29 21:16:25 +00:00
Here are relevant logs of the problem. See the attempt numbers before reload (which occurs at 16:02), and then the numbers after reload in the debug log - the attempts jump to the max for some reason, and thus the state type goes HARD, and thus notifs.
icinga.log: ... [2014-10-29 15:55:47] HOST ALERT: SA0000000738235;DOWN;SOFT;1;CRITICAL: Connection timed out to '77.221.97.61' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 1 of 4. [2014-10-29 15:57:25] HOST ALERT: P00000008021520;DOWN;SOFT;9;CRITICAL: Connection timed out to '10.2.58.230' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 9 of 19. . Traceroute completed, increased max_attempts to 19. ... (16:02:03 reload, 16:05:14 event loop started) ... [2014-10-29 16:05:15] HOST ALERT: SA0000000738235;DOWN;HARD;4;CRITICAL: Connection timed out to '77.221.97.61' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 2 of 4. [2014-10-29 16:05:15] HOST NOTIFICATION: soc;SA0000000738235;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 77.221.97.61 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 2 of 4. ... [2014-10-29 16:05:17] HOST ALERT: P00000008021520;DOWN;HARD;19;CRITICAL: Connection timed out to '10.2.58.230' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 10 of 19. . Traceroute completed, increased max_attempts to 19. [2014-10-29 16:05:17] HOST NOTIFICATION: soc;P00000008021520;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 10.2.58.230 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 10 of 19. . Traceroute completed, increased max_attempts to 19. ...
icinga.debug.log: ... [2014-10-29 15:55:47.393591] [016.1] [pid=6205] HOST: SA0000000738235, ATTEMPT=1/4, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=1 [2014-10-29 15:55:47.393595] [016.1] [pid=6205] Host was UP. [2014-10-29 15:55:47.393598] [016.1] [pid=6205] Host is now DOWN/UNREACHABLE. [2014-10-29 15:55:47.393609] [016.1] [pid=6205] Pre-handle_host_state() Host: SA0000000738235, Attempt=1/4, Type=SOFT, Final State=1 [2014-10-29 15:55:47.393794] [016.1] [pid=6205] Post-handle_host_state() Host: SA0000000738235, Attempt=1/4, Type=SOFT, Final State=1 ... [2014-10-29 15:57:25.744547] [016.1] [pid=6205] HOST: P00000008021520, ATTEMPT=9/19, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1 [2014-10-29 15:57:25.744551] [016.1] [pid=6205] Host was DOWN/UNREACHABLE. [2014-10-29 15:57:25.744554] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE. [2014-10-29 15:57:25.744559] [016.1] [pid=6205] Pre-handle_host_state() Host: P00000008021520, Attempt=9/19, Type=SOFT, Final State=1 [2014-10-29 15:57:25.744736] [016.1] [pid=6205] Post-handle_host_state() Host: P00000008021520, Attempt=9/19, Type=SOFT, Final State=1 ... (reload at 16:02:03) [2014-10-29 16:02:03.403018] [004.0] [pid=6205] Original UID/GID: 611/611 ... [2014-10-29 16:05:14.102254] [008.1] [pid=6205] Event Check Loop ... [2014-10-29 16:05:15.411340] [016.1] [pid=6205] HOST: SA0000000738235, ATTEMPT=4/4, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1 [2014-10-29 16:05:15.411343] [016.1] [pid=6205] Host was DOWN/UNREACHABLE. [2014-10-29 16:05:15.411346] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE. [2014-10-29 16:05:15.411349] [016.1] [pid=6205] Pre-handle_host_state() Host: SA0000000738235, Attempt=4/4, Type=HARD, Final State=1 [2014-10-29 16:05:15.922831] [016.1] [pid=6205] Post-handle_host_state() Host: SA0000000738235, Attempt=4/4, Type=HARD, Final State=1 ... [2014-10-29 16:05:17.055666] [016.1] [pid=6205] HOST: P00000008021520, ATTEMPT=19/19, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1 [2014-10-29 16:05:17.055670] [016.1] [pid=6205] Host was DOWN/UNREACHABLE. [2014-10-29 16:05:17.055673] [016.1] [pid=6205] Host is still DOWN/UNREACHABLE. [2014-10-29 16:05:17.055676] [016.1] [pid=6205] Pre-handle_host_state() Host: P00000008021520, Attempt=19/19, Type=HARD, Final State=1 [2014-10-29 16:05:17.590007] [016.1] [pid=6205] Post-handle_host_state() Host: P00000008021520, Attempt=19/19, Type=HARD, Final State=1 ...
Updated by mfriedrich on 2014-10-29 23:42:40 +00:00
100k servicechecks. wow. I wasn't aware that a single 1.x core can handle that, not even with mod_gearman (our benchmarks show that 80k is already a problem in terms of latency).
Regarding the installation - there are different ways to do it in 1.x (source, packages), so I am interested in a possible way of testing stuff (be it patches, or trying to reproduce stuff). I can't access your systems, so trying fetch as much information as possible.
In terms of the debug log, I'd like to generally see it (probably you might want to obfuscate stuff, if not, use michael.friedrich(at)netways.de instead of attaching here). It's hard to describe what I am looking for, or what's inside.
But mod_gearman brings a new problem into play - how the checks are being feed into the core. mod_gearman does that in-memory by merging the checkresult list (which is unsafe as you cannot control how many checkresults are stashed into the list in memory).
Although a reload must not clear the core's memory - and clean the checkresult list which is to be processed - but that list may grow quite a lot.
Could be mistaken, and it's a general problem - but that would be seen in general installations, where it does not. Maybe only under heavy load.
Looking at your reload time -
(16:02:03 reload, 16:05:14 event loop started)
That's 191 seconds where the core does nothing. No check process, no scheduling. As far as I can see, mod_gearman starts it's operation directly again, but waits for the checkresult timed event, and stays idle before merging back any checkresults. Although the remaining old checks fill up the gearman check result queue meanwhile. Would be interesting to know how many are waiting during that reload window.
Other than that, I'd be interested in the mod_gearman version. If the problem is located there, the developer might want to know as well. Plus the used gearmand job server version. Testing without mod_gearman sounds reasonable, but not realistic...
(offoptic question - when are you considering migrating to icinga2? your environment would fit a single instance running all checks by itself perfectly fine. and i could help you better, debugging it)
Updated by mckslim on 2014-10-30 20:09:42 +00:00
On our installation, we built from source.
On debug.log, it is 300mb for 30m of data - what part would you want? I'll assume the 10m around the reload time, and just send that for now, probably in separate email if i need to
gearman versions: $ /opt/gearman/sbin/gearmand -V gearmand 1.0.2 - https://bugs.launchpad.net/gearmand
$ /opt/mod_gearman/bin/gearman_top -V gearman_top: version 1.4.2
$ /opt/mod_gearman/bin/mod_gearman_worker -V mod_gearman_worker: version 1.4.2 running on libgearman 1.0.2
On the reload time, that is awhile, yes; we suppose its because we have so many hosts/services in the cfg.
On migrating to icinga2, we haven't thought much about that yet, will consider next year.
Updated by mckslim on 2014-10-31 16:21:06 +00:00
fyi that I sent this file to 'michael.friedrich@netways.de' yesterday: icinga.debug.201410291620.part.gz
Updated by mfriedrich on 2014-10-31 16:30:06 +00:00
Seems your company does not like the content. Did get a rejected mail.
InterScan Messaging Security Suite has detected a message which violated the security settings configured in Policy > Scanning Exceptions.
Updated by mckslim on 2014-10-31 17:09:04 +00:00
i tried sending it again
Updated by mfriedrich on 2015-02-14 16:03:49 +00:00
Sorry I lost a bit track here, and did not receive any debug log. Since your checks are actively scheduled, it does not compare well to the "passive host checks are soft config" option I had in mind when looking at a customers problem lately which looked similar.
I'd say you really must consider looking into Icinga 2. Especially when 2.3 is out, imagine what you can do with it... https://www.icinga.org/2015/01/13/new-features-in-icinga-2-3/ https://www.icinga.org/2015/02/11/icinga-2-using-functions-in-custom-attributes/
Apart from that, try the Vagrant boxes. I've upgraded them to CentOS7 last weekend. https://github.com/icinga/icinga-vagrant
Updated by berk on 2015-05-18 12:18:09 +00:00
Updated by mfriedrich on 2015-08-04 18:59:49 +00:00
There's not much I can do about it, and further feedback possibly won't help much (other than debugging it directly on the box obviously). I'm leaving this open for the time being if others are running in the same problem.
You might want to try this patch: https://github.com/NagiosEnterprises/nagioscore/commit/d3187897cd4d4ab1702074892b27855cf65bd992
This was fixed in d2cf76522b already.
This issue has been migrated from Redmine: https://dev.icinga.com/issues/7277
Created by mckslim on 2014-09-18 20:03:13 +00:00
Assignee: (none) Status: New Target Version: Backlog Last Update: 2015-08-04 18:59:49 +00:00 (in Redmine)
Here is an example: [2014-09-16 07:56:11] HOST ALERT: P00000008057091;DOWN;SOFT;14;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 14 of 19. . Traceroute completed, increased max_attempts to 19. (another check starts in here) [2014-09-16 08:01:59] Caught SIGHUP, restarting... (check completes in here) [2014-09-16 08:05:17] Event loop started... [2014-09-16 08:05:18] HOST ALERT: P00000008057091;DOWN;HARD;19;CRITICAL: Connection timed out to '50.73.129.57' after 160 seconds (user 'ibm_hlth_chk'). Expected prompt not found. Last output was ''. Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19. [2014-09-16 08:05:18] HOST NOTIFICATION: soc;P00000008057091;DOWN;cpe_outage_ticket;CRITICAL: Connection timed out to 50.73.129.57 after 160 seconds (user ibm_hlth_chk). Expected prompt not found. Last output was . Attempt 15 of 19. . Traceroute completed, increased max_attempts to 19.
See that the check before reload (SIGHUP) was attempt 14 of 19. And see that check result after reload was DOWN;HARD;19, thus it went HARD on attempt 15 before getting to attempt 19. Why is this happening? This causes premature NOTIFS to occur, which increases the amount of problems we have to process.
Relations: