Closed icinga-migration closed 9 years ago
Updated by mfriedrich on 2015-04-03 16:25:10 +00:00
Check_MK uses a different mechanism to execute checks and then pass back the results. The immediate return code / output when executing one of its checks is just the active "agent", but all the collected results are passed to the core either through the command pipe or by injecting the files into the check result reaper (which may take a while processing all results in a sequential manner).
I suspect a different problem than the core processing check result files here:
There were no changes involved between 1.12.x and 1.13.x, as you may check on the commit list at https://github.com/Icinga/icinga-core/commits/master ... one patch which could've influenced that processing was left alone, not breaking things (see #8141).
Updated by ralfk on 2015-04-03 16:59:54 +00:00
Thanks for your feedback.
Of course is Check_MK using a different method of executing checks (which in deed is somehow complicated to understand). But as you can see in the log above from running Icinga 1.12.2, the passive checks are successfully processed (see "SERVICE: CPU load") by the spooler. After all passive checks are processed, the check result from the active "agent" is handled.
In Icinga 1.13.0 the check result from the active agent is the only result handled at all. The passive checks are completely missing.
I also thought it could be a permission issue. But believe me, I think that I checked everything. No SELinux or so. The spooler dir permissions were fine. Well, the check on command line with the same user the Icinga daemon is running works fine, so I guess it's not a permission problem. Maybe a path or environment problem (similar than running Linux command on CLI or via cron causes trouble sometimes)?
As soon as I switch from 1.12.2. to 1.13.0 again, the checks are not handled. Falling back to 1.12.2: it works. I do this by installing from source (make install). Nothing else changing between the version switches.
As soon as I am back at work next week, I can provide some config output (if this helps at all). I guess it would help if it would be possible to check the files in the spooler dir. Is there any (debug) mode or trick in order to avoid deleting these spooled files after processing?
I have another Icinga instance on my home server without Check_MK. I will install Check_MK there as well in order to have a physically different server for more tests. Very curious, I know.
Updated by mfriedrich on 2015-04-03 17:06:33 +00:00
I don't have any checkmk installation in order to reproduce the problem. I used it several years ago and am well aware of its internal processing and methods (also: hacks), though it highly depends on your system and configuration. As well as versions and configurations involved - you should provide them too.
The check result spool dir was never designed being exposed to addons injecting their check results. Therefore there are not much things you can set or enable in order to help debugging it. One thing you could do - let the files generated by checkmk being written to a different spool dir, and manually copy these files using a cronjob/daemon. That would slow down check result processing, but keeps your files on disk for further verification.
If it does not resolve like that, use the external command pipe for passing check results. This method should work under any circumstances.
Updated by ralfk on 2015-04-04 15:52:36 +00:00
I found out which changes in Icinga are responsible for the problems and how to apply a workaround in Icinga 1.13.0.
As I assumed, the problem was caused by something like an environment change. I played a bit with the Check_MK active check command, which is by default defined as:
define command {
command_name check-mk
command_line python /var/lib/check_mk/precompiled/"$HOSTNAME$"
}
Changing this check command to the following fixes the problem in Icinga 1.13.0:
command_line /usr/bin/python /var/lib/check_mk/precompiled/$HOSTNAME$
So, you must set the full path of python AND the double quotes must be removed. Both changes are needed.
Then I tried to find out which change in Icinga 1.13.0 caused this behavior. There are in deed more changes incorporated in 1.13.0, not only the commits after 1.12.2. Some commits were not merged to 1.12.x but only to 1.13.0.
The following commit are responsible:
See also changesets from #8139 and #8140 from @jrhunt.
I manually reverted the four corresponding files in my 1.13.0 source to release 1.12.2 and voila: The default check command definition from Check_MK worked again.
I don't know if changes from #8139 and #8140 do affect any other checks or addon, maybe this is purely related to Check_MK. However, I thing that the changes in Icinga break the compatibility with a common addon, which should be fixed in Icinga. I could test it against my Check_MK instance.
Updated by mfriedrich on 2015-04-04 20:33:19 +00:00
ralfk wrote:
I found out which changes in Icinga are responsible for the problems and how to apply a workaround in Icinga 1.13.0.
As I assumed, the problem was caused by something like an environment change. I played a bit with the Check_MK active check command, which is by default defined as: [...]
Changing this check command to the following fixes the problem in Icinga 1.13.0: [...]
So, you must set the full path of python AND the double quotes must be removed. Both changes are needed.
Which sounds weird, and sounds more like an issue with your shell (PATH environment, and shell escaping). Maybe you've added some init script presets changing this.
It probably did influence your manual tests, where the environment is different. Wrap your command line, and add `env -i >> /tmp/i2.env.`date +%s` or similar to write the entire environment for the icinga daemon onto disk.
Then I tried to find out which change in Icinga 1.13.0 caused this behavior. There are in deed more changes incorporated in 1.13.0, not only the commits after 1.12.2. Some commits were not merged to 1.12.x but only to 1.13.0.
That's how bugfix and major releases are developed. They always drift away, and support/1.x consists only of cherry-picked bug fixes, while the master contains everything.
The following commit are responsible: * https://github.com/Icinga/icinga-core/commit/13c0e2b221606c71b52ebb11a92e0f1d9c4bddf3 * https://github.com/Icinga/icinga-core/commit/12ff9969a7c9eba7875e2b0ab1654910b1619b06
See also changesets from #8139 and #8140 from @jrhunt.
I manually reverted the four corresponding files in my 1.13.0 source to release 1.12.2 and voila: The default check command definition from Check_MK worked again.
#8139 does nothing by default, if a neb module doesn't register fds. #8140 protects the checkresult list with a mutex.
Maybe this change is the only root cause, you might try to only remove that:
diff --git a/base/utils.c b/base/utils.c
index b82a5ea..9e05211 100644
--- a/base/utils.c
+++ b/base/utils.c
@@ -3079,9 +3079,6 @@ check_result *read_check_result(void) {
first_cr = check_result_list;
check_result_list = check_result_list->next;
- /* forcibly detach this check result from the list tail */
- first_cr->next = NULL;
-
/* unlock the mutex; all the destructive operations are done */
pthread_mutex_unlock(&check_result_list_mutex);
Everything else is just an internal lock. Question aside - which addons are you using exactly? (livestatus, etc)
I don't know if changes from #8139 and #8140 do affect any other checks or addon, maybe this is purely related to Check_MK. However, I thing that the changes in Icinga break the compatibility with a common addon, which should be fixed in Icinga. I could test it against my Check_MK instance.
Reverting all git commits just for making something work isn't really an option here, then it would just be staying on 1.12.2 and so on. I'd be interested in which change exactly is causing the check results not being processed. It's pretty much unclear to me where this error happens, and you did not attach the requested files / tests which makes it hard to understand the problem.
Updated by ralfk on 2015-04-05 07:47:41 +00:00
dnsmichi wrote:
ralfk wrote: > So, you must set the full path of python AND the double quotes must be removed. Both changes are needed.
Which sounds weird, and sounds more like an issue with your shell (PATH environment, and shell escaping). Maybe you've added some init script presets changing this.
No, the init script is the original from Icinga. The test I did yesterday and today were made on another Icinga instance on my home server with a "default" Icinga installation from source. No tweaks.
> Then I tried to find out which change in Icinga 1.13.0 caused this behavior. There are in deed more changes incorporated in 1.13.0, not only the commits after 1.12.2. Some commits were not merged to 1.12.x but only to 1.13.0.
That's how bugfix and major releases are developed. They always drift away, and support/1.x consists only of cherry-picked bug fixes, while the master contains everything.
Fully agree. This was just a friendly version of telling you, that your first rough analysis was wrong (that there were no corresponding changes made between 1.12.2 and 1.13.0) ;-)
Maybe this change is the only root cause, you might try to only remove that:
Yeah! Removing just the line "first_cr>next = NULL;"- did the trick. I can now run Icinga with the default Check_MK command definition and all passive checks are processed.
-> See next comment.
Question aside - which addons are you using exactly? (livestatus, etc)
Good question. I run livestatus and idomod. I just tested running the original 1.13.0 with both modules deactivated, which did not help. So, it's not related to a module running.
Reverting all git commits just for making something work isn't really an option here, then it would just be staying on 1.12.2 and so on. I'd be interested in which change exactly is causing the check results not being processed. It's pretty much unclear to me where this error happens, and you did not attach the requested files / tests which makes it hard to understand the problem.
Okay, I will provide configs and also the test with the environment printout. But not today. I guess we both have more interesting things to do on Eastern Sunday and Monday. So, let's go on on Tuesday. Happy Eastern!
Updated by ralfk on 2015-04-05 08:04:20 +00:00
Ah, shit. I reverted now back to the version with the one line removed in "base/utils", but the checks are not processed. So, my first analysis was wrong, sorry. I might be confused between the version changes and the regular 1min-check interval of Check_MK. Let's go on next week ...
Updated by mfriedrich on 2015-04-05 14:16:42 +00:00
Well, after thinking about that problem while working on the icinga-vagrant project yesterday (totally different, but good for the brain), I would consider the file descriptor patch dangerous in closing fds for the child process.
There's a difference in handling
python /var/lib/check_mk/precompiled/"$HOSTNAME$"
and
/usr/bin/python /var/lib/check_mk/precompiled/"$HOSTNAME$"
The full path gets executed by invoking popen() directly (or it has shell meta characters, one of these conditions are must be met), while path-dependent calls without shell metacharacters are handled by execvp(). See base/checks.c run_check() and base/utils.c has_shell_metachars().
int has_shell_metachars(const char *s) {
if (*s != '/' && *s != '.')
return 1;
if (strpbrk(s, "!$^&*()~[]\\|{};<>?'\""))
return 1;
return 0;
}
Your initial configuration would always be run using popen, but I think this does not matter in terms of the problem.
FDs are closed at 3 stages:
This sanity check in close_registered_fds() looping through the entire array isn't the best idea either, but still valid.
for (i = 0; i < REGISTERED_FD_MAX; i++) {
if (registered_fds[i] < 0) continue;
close(registered_fds[i]);
registered_fds[i] = -1;
}
The problem is, that the initialization of the array is wrong. The expected behaviour by the author would've been to inialize the entire array with the value "-1" not closing any FD by default.
int registered_fds[REGISTERED_FD_MAX] = { -1 };
Though the real result is that only the first entry is initialized as "-1", the rest contains 0 and other random numbers in the worst case. See http://stackoverflow.com/questions/1065774/c-c-initialization-of-a-normal-array-with-one-default-value for some discussion and details.
I could think of closing the file descriptor of the parent's process which could only be the checkresult fd
check_result_info.output_file_fd = mkstemp(output_file);
Not sure though if that affects the reaping of checkresults sent by the checkmk script though. It might close other fds by accident.
At least the wrong array initialization can be fixed easily. Please test that patch, but I am not sure if it really fixes your problem. https://git.icinga.org/?p=icinga-core.git;a=commit;h=8d312ed41045ec39c3e6397e812310e9ac000b3c;js=1
Updated by ralfk on 2015-04-05 16:11:50 +00:00
Great analysis. Your patch fixed it, apart from a compile error, since you declared i within the loop. I created a patch for this, see attachment.
I double-checked the behavior by manually patching Icinga 1.13.0 tarball and also by checking out latest master branch. Both work now with the default check-mk command definition.
Off topic: I recently tested Icinga 2 and Icinga Web 2. Looks very promising, even Web 2 looks much better than Web 1 (which I did never like and used in practice). I like the new approach of Icinga 2. However, a big show stopper for me is that Check_MK is not working with Icinga 2 core (I mean just the auto-discovery of checks and the agent). I know that there were some discussions on mailinglists and the forum about it. Either Icinga 2 or Check_MK team should consider acting on that issue. I know that both projects are rather competitors. But maybe there is a way? Once Check_MK configuration (which is in Icinga 1 / Nagios format) would work, I would go for Icinga 2 as core. My installation at work is in a big company and for a distributed monitoring approach Icinga 2 sounds ideal. I really would show that Icinga is the better choice than Nagios or the micro core from Check_MK.
Updated by mfriedrich on 2015-04-05 16:58:29 +00:00
Thanks. I'm compiling on Fedora 21, not really a fan of C99 declarations (but that's something for beer and opinions).
I've applied your patch, and will tag 1.13.1 soon-ish. Thanks for being a brave tester :)
Regarding OT with Icinga 2 - please join the mailing lists for a discussion on that. I hear that from time to time, but never seen a vibrant discussion anywhere about that topic (by vibrant I mean someone patches the MK export to generate I2 config).
Updated by mfriedrich on 2015-04-05 17:07:43 +00:00
Updated by mfriedrich on 2015-04-05 17:08:04 +00:00
Updated by ralfk on 2015-04-05 17:26:08 +00:00
Ah, thanks. Great to see again how fast and transparent Open Source is fixed. What an honor to be now on the list of authors with such a small contribution. I am rather an architect and admin than developer. But even as non-hardcore developer it's often important to write code to glue things together, even though in this special case my C skills are from studies 20 years ago. Especially the availability of git makes it easier for people like me to contribute.
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8986
Created by ralfk on 2015-04-02 12:21:52 +00:00
Assignee: mfriedrich Status: Resolved (closed on 2015-04-05 17:07:43 +00:00) Target Version: 1.13.1 Last Update: 2015-04-05 17:26:08 +00:00 (in Redmine)
If I use Icinga 1.12.2 with Check_MK 1.2.6, everything works fine. The Check_MK passive check states (triggered by the main "check-mk" active check command) are properly updated in the Icinga GUI and I see "PASSIVE SERVICE CHECK" infos in my syslog.
After upgrading to Icinga 1.13.0 the passive checks from Check_MK do not work anymore. The active "check-mk" check command is properly executed and the check output itself is fine, but the passive check states are not updated any more. No entry in syslog about incoming "PASSIVE SERVICE CHECK".
Strange is: If I execute the "check-mk" command on command line (CLI) as the Icinga daemon user, then the passive checks are properly send to the Icinga daemon.
But as soon as the Icinga scheduler is running the active check, I don't get passive check updates in Icinga. I checked the config of the active check, which is fine. I also have debug logging enabled. See log below.
After falling back to Icinga 1.12.2 with the same config, the passive checks are fine again. I could check this behavior in two different Icinga instances.
What you can see in the logs: In Icinga 1.13.0 only the active check results is handled:
In Icinga 12.2 the passive check results are handled first (example "CPU load") and at the end the active check result as well. Icinga 1.13.0 does skip the passive checks. However, to be sure that the check result files are proper, is there a way to avoid deleting the check result files after processing? Maybe the check result file in Icinga 1.13.0 is different when executing by the daemon or manually (which works). Is there something changed in Icinga 1.13.0 which could cause this?
Logging with Icinga 1.13.3
Logging with Icinga 1.12.2
Attachments
Changesets
2015-04-05 14:14:48 +00:00 by (unknown) 8d312ed41045ec39c3e6397e812310e9ac000b3c
2015-04-05 16:51:50 +00:00 by (unknown) 67cc8d2b610346320e827eb13729e7e7f0a6da68