Open bernisys opened 3 years ago
This seems really like a 1 off and should be a plugin if anything. Hoesntly, I would just add a sleep to the cron myself.
* * * * * sleep 10; php poller.php
True, it's probably not the everyday use-case in each environment. I have reasons to think that just a sleep in the cron is not that flexible, and you need to be able log in to the system's shell. If you have for example GUI-admins who have ho shell access they could not change that setting, and also you have to do that by logging in to each individual poller ad also fiddling with cron contents, which might be error-prone. Would be much nicer to have it centrally configurable inside the tool itself, so one can adapt a value per poller over the GUI.
Just an idea though - as noted, i have already created a wrapper script for our environment (which does even more than just the hold-off). Maybe someone else has a similar use-case and can add their thoughts about this proposal?
Is there any hint, how i would be able to achieve this kind of solution with a plug-in? This solution sounds interesting, but i've never looked into the plugin outlines...
BTW a hold-off per poller could partially release (distribute in time) some stress on the central DB. If i understand the whole polling mechanism correctly, you could for example put all pollers with really small device counts to the start of the cycle and shift the larger ones a bit out to a later point in time. The number of parallel DB updates could then be a fair bit less, and the boost-table updates on the central system would also be hitting in a less parallelized manner.
just offering ideas ...
I would span the device port and find out who the worst culprit is and make hey. Unless you are using verify all fields reindex method, my guess is it's not cacti.
Adding a bit more of my thoughts regarding this topic.
After upgrading to 1.2.16 and using spine from the branch we were still seeing some of these regular spikes which i mentioned in #4219 (you can see this poller time aggregate, the 1st image i attached). I have (among further optimizations which were not significantly beneficial) added a parameter to our poller wrapper script which allows us to flexibly shift the actual spine starts by an individual number of seconds. We have one really loaded poller with 1200 devices and close to 190k SNMP sources and 6k script queries, that one starts first now, delayed by 80 seconds from the 5min cron job. One of the second "biggest" pollers with ~700 devices and 215k SNMP/1.3k scripts follows with 100 seconds delay to cron start. And then all the others are starting at +120 seconds away from cron.
This has greatly improved our performance, i can recommend this relaxation pattern for larger environments. The peaking is now completely gone, the effect was comparable to the effect of the spine upgrade.
My assumption is that the insert-queries to the main DB from the rather fast running SNMP sampling are coming so fast that the database is highly loaded and this has a negative side effect on all pollers, even those which would actually run fine. Shifting the time when this first in-rush happens apparently relaxes the punctual DB load, spreads it across a wider time. Now all pollers can deliver their data much faster one by one.
As a comparison, here's the improvements we achieved in the past - sorry for poor image quality, i just screenshotted them out of my internal mails. Disabling the poller replication just added a bit of stability, most beneficial effects were coming from the spine from git master branch.
You can easily spot the situation relaxing gradually over time after applying several counter measures. Here's some detail about the situation before and after new cacti + spine installation
Then last Friday in the evening i shifted our fist poller, you can see the light green line significantly decreasing even if the one i shifted was actually the one with the blue graph (ignore the peak on Fr midnight, that's a different story with one overloaded network). You can see the blue graph already stabilizing from Friday afternoon, it's shimmering thru the other spikes. And on Monday morning i shifted the second one (purple) ... this relaxed the situation completely.
The results are speaking for themselves:
This shows that a big congestion can be resolved just by time-delaying the poll starts and i would really prefer having this possibility inside cacti, as it's much easier to check and adapt in the central GUI.
Feature Request
Is your feature request related to a problem? Please describe
There are several other systems which are querying the devices we monitor, and at some points in time it seems that some of the devices (Cisco ACE for example) are sometimes overloaded with the incoming SNMP requests and start to either react much slower or don't respond at all. So i wrote a wrapper-script which takes care of this problem by inserting a sleep-time before starting the actual polling cycle. This way we don't hit always the same sampling times and the devices are much more relaxed and our poll times are much better as well.
Describe the solution you'd like
For each polling profile there should be a configurable hold-off time to be able to shift the polling process but still keep the simple cron job "*/5" minutes. This would allow for a more flexible polling configuration, as several devices could be assigned to specific profiles with a shifted polling period within the standard 5min window.
Describe alternatives you've considered
Additional context
Another idea, maybe in the same league: Allow to configure an external script that can be used for example for active cluster node detection in a failover-setup, where cacti is actually having a cron on both nodes and not runing on a shared storage. The script output (TBD) controls the behavior of the cacti polling:
This way no whole wrapper needs to be built, but just an external script that is configurable inside cacti on a per-poller basis.