dpeters / puppet-opsview

Puppet types/providers to support Opsview resources
12 stars 10 forks source link

constant reloads can cause issues #15

Open n2aws opened 11 years ago

n2aws commented 11 years ago

I'm documenting this here for 2 reasons.

1) I'm going on vacation tomorrow, and won't have a chance to fix this until next week.

2) I'm not running the latest version of this module, so it may have already been addressed. (Will test next week)

Background: Opsview doesn't handle multiple reloads well. The NDO log file becomes rather large. When this is compounded by doing a lot of reloads quickly, this file can grow to multiple gigs, causing the NDO import process to choke, or run out of RAM.

Cause: The way opsview_reload is currently implemented in this module, it's called directly instead of triggered (via a subscribe or similar.) This is fine for a small number of hosts that have staggered puppet runs. As the number of hosts increase, the chance of multiple hosts causing a reload at roughly the same time, increase.

Suggestion: This is a multi-part fix. First, opsview_monitored should probably be implemented as a virtual resource (using puppetDB, or storeconfigs)
Second, the actual "reload opsview" part should be seperated into a seperate class. (I'll explain why in a moment) Third, only a small number of hosts should actually communicate with the opsview API. It should submit everything that has changed for every host, and trigger a SINGLE "reload" once that is done. This also reduces the number of clients that need to have a (potentially sensitive) opsview.conf file installed.

Right now, I've setup my opsview_monitored as a virtual resource, and only have 2 hosts that actually use the API. However, my environment is pretty dynamic, so occasionally that one opsview run will trigger more than 100 reloads. Instead if this were split off into it's own class (or better, type) then we could "subscribe" to it, and trigger one reload per run, rathe than 100+

When I return from my vacation, I'll update my module and see if this is still and issue. If it is, I'll submit a pull-request.