Problem disabling monitors or removing hosts

salderma commented 11 years ago

I am running Foreman+Puppet+PuppetDB on the same host.

There are strange cases where services or hosts are left dangling in Nagios after being disabled or removed from target node's classification:

Scenario 1: Node has a service monitored like OpenSSH. Puppet creates the nagios service monitor for the Node. Suppose due to an infrastructure firewall, Nagios can no longer reach node:tcp_22, and I wish to set openssh_monitor = false. Puppet run on the nagios server produces the following output:

 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_process.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_process.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_process.cfg]/mode: mode changed '644' to '755'
 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_tcp_22.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_tcp_22.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest.example.edu-openssh_tcp_22.cfg]/mode: mode changed '644' to '755'

After this action the puppet run completes, with no service restart/reload of the nagios service. Even if there was, the service files are still readable by the nagios user and still get parsed upon service restart.

Scenario2: Node is removed from Foreman+Puppet+PuppetDB via Foreman Delete + puppet node deactivate . Nagios host and service monitors are left hanging on the nagios server. Puppet run on the nagios server looks similar to the above, but also changes host configs:

 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-00-baseservices.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-00-baseservices.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-00-baseservices.cfg]/mode: mode changed '644' to '755'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-nrpe_process.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-nrpe_process.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-nrpe_process.cfg]/mode: mode changed '644' to '755'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_process.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_process.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_process.cfg]/mode: mode changed '644' to '755'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_tcp_22.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_tcp_22.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/services/buildtest2.example.edu-openssh_tcp_22.cfg]/mode: mode changed '644' to '755'
 notice: /File[/etc/nagios/auto.d/hosts/buildtest2.example.edu.cfg]/owner: owner changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/hosts/buildtest2.example.edu.cfg]/group: group changed 'root' to 'nagios'
 notice: /File[/etc/nagios/auto.d/hosts/buildtest2.example.edu.cfg]/mode: mode changed '644' to '755'
 notice: Finished catalog run in 3.60 seconds

Again, no service nagios restart/reload within the puppet run.

The only resolution is to manually delete the files modified and reload nagios.

alvagante commented 11 years ago

For case 1 the problem is due to the fact that in most modules the monitor defines are declared only if $monitor is true, otherwise they are not evaluated. Since monitoring is an optional feature I don't want to have the relevant resource declared (with an ensure => absent) if $monitor is false. A possible approach could be to declare them whenever $monitor_tool is set, then they are enabled or disabled explicitly according to the value of $monitor . Would it make sense for you? This is a change that has to be made on ALL the modules :-I so I'd test it first on a specific one (openssh?)

salderma commented 11 years ago

If you can mock up an example of how you'd like it to work, I'd be happy to help push it into the modules.

I would think a less popular module might be a good one to start with, but I will defer to your expertise.

TFTP or Splunk come to mind for me. As I was about to dig into these to adjust how the application of monitoring and firewall are applied specifically to the method of deployment - e.g. avoid xinetd port monitoring from the tftp module, since it seems to create a malformed Nagios Service definition (perhaps that's another bug).

alvagante commented 11 years ago

Please, give a look to this: https://github.com/example42/puppet-tftp/commit/67eb4e530b5879a9166bdcc7d9a0d31b9137e7e6 and let me know if it works (for tftp). If yes we can extend the solution to other modules.

salderma commented 11 years ago

Ok, some notes on my testing. test host: puppet 2.6.18-3.el6 puppet master: foreman-1.2.2-1.rl6 puppet-3.2.2-1.el6 puppet-server-3.2.2-1.el6 puppetdb-1.3.2-1.el6 nagios host: puppet-2.6.18-3.el6 nagios-3.5.0-1.el6

Setting tftp_monitor = true, produces no tftp specific monitors on the Nagios Server, or in Puppi on the localhost.
Monitor checks for Nagios and Puppi: a. puppi checks for xinetd_process and xinetdtcp are created. b. nagios checks for xinetd_process and xinetdtcp are created.
Setting tftp_monitor = false, produces no significant change in monitors or status (output in paste: http://pastebin.com/Lj2ydvyV)
Setting tftp_startup_mode = standalone, tftp_monitor = true, results in a non-functional TFTP daemon. a. It appears there is no init script, so the service start fails. b. Output of puppet run is at pasted here: http://pastebin.com/urCLR9xA
Monitor checks for Nagios and Puppi a. puppi checks for xinetd have been removed from the local host. b. puppi check for tftp_process has been created on the local host. c. nagios check for xinetd remain on the nagios host. d. nagios check for tftp_process has been created, but remains in a critical state.
Changing tftp_startup_mode from xinetd to standalone after xinetd has been configured does not disable the xinetd config for tftp.

I guess I picked a bad module to start with, it would seem we've got some other problems in the way of the issue with module_monitor = false between module and nagios server.

I believe I can cleanup the xinetd monitoring by 1) removing xinetd port monitoring from the xinetd module and 2) take an ensure absent approach when switching modes in the tftp module.

alvagante commented 11 years ago

Yes, definitively tftp is not the proper module to test this. Would you try to apply the changes I did to the tftp module to another module you use, which doesn't use xinetd, and let me know how it works?

salderma commented 11 years ago

Sure, not a problem. I do have a ? on the tftp module - do you have any idea how that xinetdtcp check originates? I need to figure out how to get rid of it, I'm guessing it's an artifact in the puppetdb from including the puppet-xinetd module but I'm not sure how to clean it up.

alvagante commented 11 years ago

Yes, that's due to the inclusion of xinetd module in the tftp one, when $startup_mode = 'xinetd' (as by default). So in order to remove its checks, apply to the xinetd module a patch similar to the one done on tftp and pass monitor => false, (or disable => true in you don't need the xinetd service (since it has been installed by tftp...). Incidentally, the tdtp module should support the standalone service for Debian derivatives, on RedHta no default init script is provided (you can pass it via $file_init_template but it should be better to have a working one embedded in the module).

salderma commented 11 years ago

Ok, thanks, the mystery for me is that xinetd module doesn't have a port monitor at all in init.pp, oddly there is a firewall stanza which, given xinetd's purpose, I'm not at all sure how you'd handle that.

I've moved your code in to my fork of the splunk module. I'll run a quick test and update the issue.

alvagante commented 11 years ago

Actually is a mystery also for me, there should not be any trace of xinetdtcp monitor. And the firewall stanza is actually pointless, there. (Since no port is defined in params.pp it's not used in any case).

salderma commented 11 years ago

Couple notes on pre-testing tasks to undo previous configuration...

had to set tftp_disable = true
had to explicitly include xinetd module and set xinetd_disable = true

See commit: https://github.com/salderma/puppet-splunk/commit/3ea8cc9b96aba161ee0f4fb3c3ce649788672b21

Going from splunk_monitor = true to splunk_monitor = false, Success on the nagios end: http://pastebin.com/wZztgRKt

Going back to splunk_monitor = true works as expected to re-establish monitoring.

Going to splunk_disable = true provides the same results as splunk_monitor = false.

Looks good!

salderma commented 11 years ago

alvagante commented 11 years ago

They look good, when you are confident about them (I am :-) would you please submit a pull request?

lieutdan13 commented 10 years ago

To help with scenario 2, I this can be (partially) solved by adding tags to exported resources.

For example, in nagios::host:

tag     => "nagios_check_${nagios::target::magic_tag}",

becomes

tag     => ["nagios_check_${nagios::target::magic_tag}", "nagios_host_${name}"],

When the node no longer exists and you are unable to run puppet on it, add the following to the nagios server's node:

File <| (tag == "nagios_check_${nagios::target::magic_tag}" and tag == "nagios_host_[FQDN of deceased host]") |> {
    ensure => absent
}

Note: Exact syntax may be different.

alvagante commented 10 years ago

That's interesting, but I'm not sure it would solve the case. There's already a purge on the nagios's auto.d directory, so the files created there DO exist as exported resources (more or less stale) on the PuppetDB. So the normal solution would be to deactivate the node. Also I don't fancy the idea of changing the (core) code, adding as needed the FQDN of deceased host on the Puppet code. Still the approach is interesting and may be worth some investigation.

lieutdan13 commented 10 years ago

Even after I deactivated the node and ran Puppet on the Nagios Server (same as Puppet Master in my case), the stale files in auto.d remain.

After re-visiting my solution, would it make sense to set $::nagios_grouplogic = $fqdn, thus not having to change any code. However, I don't feel that adding an extra tag to all exported resources would hurt anything.

alvagante commented 10 years ago

Yes, the extra tag does not harm. Note that $::nagios_grouplogic = $fqdn would lead to a Nagios servers that collects checks only about its nodes. The usage pattern is to set the name of a variable that allows the presence of different Nagios servers that check nodes that share the same variable value. Tipical usage is something like $::nagios_grouplogic = 'env' , where env is a custom variable or fact that expresses the environment of a system. In this was a Nagios server with $env=test will collect checks only of the nodes with $env=test, a Nagios server with $env=prod will monitor just prod nodes.

lieutdan13 commented 10 years ago

I've been in the mindset of a single Nagios Server. Your explanation clears everything up. Thank you.

I will add the extra tag to exported resources and issue a PR sometime in the near future.

alvagante commented 10 years ago

Reviewing the code now I saw that the purging of the services and host directories is disabled by default: In manifests/skel.pp we have:

file { 'nagios_configdir_hosts': ensure => directory, path => "${nagios::customconfigdir}/hosts", mode => '0755', owner => $nagios::config_file_owner, group => $nagios::config_file_group, require => File['nagios_configdir'], recurse => true, purge => $nagios::bool_source_dir_purge, force => true, }

file { 'nagios_configdir_services': ensure => directory, path => "${nagios::customconfigdir}/services", mode => '0755', owner => $nagios::config_file_owner, group => $nagios::config_file_group, require => File['nagios_configdir'], recurse => true, purge => $nagios::bool_source_dir_purge, force => true, }

and source_dir_purge is disabled by default. Also is a wrong param for these dirs and I would not set it to true. But I would try to set purge => true (eventually adding a parameter for that, like "purge_stale_checks") for the ${nagios::customconfigdir}/services and ${nagios::customconfigdir}/hosts dir, as that should ensure that only exported resources present on the puppetbd are placed. The risk here is that if for whatever reason you don't collect exported resources from the puppetbd, you wide the whole existing checks...

salderma commented 10 years ago

Forgive me if I misunderstand the comment. Are you intending to completely purge the auto.d/hosts and auto.d/services directories? I've manually added hosts and services into these directories for things like websites whose dns/ip entries terminate on a load-balancer that isn't managed by puppet. e.g. https://www.example.com, where ssl terminates on the load balancer, but content is served by a group of apache hosts managed by puppet. Custom checks for things like ssl cert expiration, content service performance, etc. apply to the nagios config for this pseudo-host. Am I correct in understanding that I would need to find a different place to store those... perhaps the auto.d/extras directory would be more appropriate.

On Sat, Sep 21, 2013 at 2:57 AM, Alessandro Franceschi < notifications@github.com> wrote:

Reviewing the code now I saw that the purging of the services and host directories is disabled by default: In manifests/skel.pp we have:

file { 'nagios_configdir_hosts': ensure => directory, path => "${nagios::customconfigdir}/hosts", mode => '0755', owner => $nagios::config_file_owner, group => $nagios::config_file_group, require => File['nagios_configdir'], recurse => true, purge => $nagios::bool_source_dir_purge, force => true, }

file { 'nagios_configdir_services': ensure => directory, path => "${nagios::customconfigdir}/services", mode => '0755', owner => $nagios::config_file_owner, group => $nagios::config_file_group, require => File['nagios_configdir'], recurse => true, purge => $nagios::bool_source_dir_purge, force => true, }

and source_dir_purge is disabled by default. Also is a wrong param for these dirs and I would not set it to true. But I would try to set purge => true (eventually adding a parameter for that, like "purge_stale_checks") for the ${nagios::customconfigdir}/services and ${nagios::customconfigdir}/hosts dir, as that should ensure that only exported resources present on the puppetbd are placed. The risk here is that if for whatever reason you don't collect exported resources from the puppetbd, you wide the whole existing checks...

— Reply to this email directly or view it on GitHubhttps://github.com/example42/puppet-nagios/issues/25#issuecomment-24857396 .

Sean M. Alderman Senior Engineer, UDit Systems Integration and Engineering University of Dayton 300 College Park Dayton, Ohio 45469-1530 (937) 229-5088 salderman1@udayton.edu

"We are not some casual and meaningless product of evolution. Each of us is the result of a thought of God. Each of us is willed. Each of us is loved. Each of us is necessary." - BXVI

alvagante commented 10 years ago

Exactly, auto.d/extras is supposed to be the place for this extra stuff, either added via Puppet (as plain files or via defines) or manually directly on the Nagios server (better if these files are copied on the puppetmaster). auto.d/hosts and services should be used only for automatically exported resources (so here you could try to force the purging).

salderma commented 10 years ago

Do you have a suggestion about how I go about patching this for files in auto.d/hosts? It seems that deactivated hosts configs get chowned and chmodded, but their module monitors get removed if they've been patched.

alvagante commented 10 years ago

auto.d/hosts and auto.d/services are purged if $source_dir_purge = true (default false)

salderma commented 10 years ago

Oh, great! I guess I missed that above but I see it in the code now.

salderma commented 10 years ago

I'm not sure if this effect is related to source_dir_purge = true, but I had an interesting result when removing a monitor::plugin recently...

Here's what happened: I had a monitor::plugin defined in a node manifest. The nagios::service was defined and operating for this monitor. Later, I removed the monitor::plugin from the node's manifest. Puppet silently ran on the node, and when puppet ran on the Nagios server, the nagios::service definitions for this monitor::plugin were removed as expected. The problem is that the Nagios service was not notified to restart after the resources were removed. Later on, I realized such to be the case and manually restarted the nagios service and things cleared up - as I expected originally.

I'm guessing that when source_dir_purge removes files, it does not notify the service to restart?

alvagante commented 10 years ago

Interesting. Probably a : notify => $nagios::manage_service_autorestart, where are defined (in https://github.com/example42/puppet-nagios/blob/master/manifests/skel.pp) the dirs file { 'nagios_configdir_hosts': file { 'nagios_configdir_services': might solve the problem. If you can test this I'd happily accept a PR

example42 / puppet-nagios

Problem disabling monitors or removing hosts #25