Closed infraweavers closed 4 years ago
i guess this would pile up go routines pretty fast and with that, potentially exceeds open file limits and such, since each of those clients will try to create a new network connection which potentially runs into a timeout.
@sni Absolutely, this basically just moves the problem into go, from naemon. We were thinking that probably the easiest solution for that is to limit the maximum number of async dupserver requests in-flight; based on the idea that we can assume if 1000 (or whatever value, tunable from the config) are still "sending" that another 1 or 2 would also end up hanging for a while.
1000 hanging network connections and go routines is also quite a number. What about one go routine which periodically checks the network connection while stashing the results on a queue (with a limit). And as soon as the dupserver is available again, the queue is flushed. This way we only need memory to store the queue temporarily.
@sni sure.
Just to confirm what you mean, the way I interpret that is we'll change the SendResultDup
to push onto an in-memory queue, then create a single go routine (possibly 1 per configured dupserver
actually); whose job it is to pull items from that queue and go the actual work of sending them to the dupserver
; with a limit of how large the number of items on that queue can be.
right, that's what i meant. One per dupserver sounds like a good way to do it.
closing this one, PR continues in #14
We use mod_gearman in a high-availability setup (i.e. one active and one passive) and use
dupserver
to submit the checkresults to the secondary server. We find that when the secondary server is unavailable, all the checks on the primary grind to a halt as it appears to be waiting for the secondary to respond. We've not been able to find a tunable connection timeout in the Gearman go client, however it seems that making the dupserver send asynchronous is possibly the best path to having this work as we intend.We've made it a configuration option, so that the default is unchanged; and users who are unable to accept invisible job losses can keep the default option. For us, with
send_dup_results_async = yes
when we kill our secondaries off, the primary is essentially unaffected except for occasional flurries of logs for:We were unable to find the example mod_gearman_worker configuration file to add a stub for this option in, so it's not included in this PR.