Adding ability to send answer to dupservers asynchronously

infraweavers commented 4 years ago

We use mod_gearman in a high-availability setup (i.e. one active and one passive) and use dupserver to submit the checkresults to the secondary server. We find that when the secondary server is unavailable, all the checks on the primary grind to a halt as it appears to be waiting for the secondary to respond. We've not been able to find a tunable connection timeout in the Gearman go client, however it seems that making the dupserver send asynchronous is possibly the best path to having this work as we intend.

We've made it a configuration option, so that the default is unchanged; and users who are unable to accept invisible job losses can keep the default option. For us, with send_dup_results_async = yes when we kill our secondaries off, the primary is essentially unaffected except for occasional flurries of logs for:

 failed to send back result (to dupserver): dial tcp 192.168.12.99:4730: connect: connection timed out

We were unable to find the example mod_gearman_worker configuration file to add a stub for this option in, so it's not included in this PR.

sni commented 4 years ago

i guess this would pile up go routines pretty fast and with that, potentially exceeds open file limits and such, since each of those clients will try to create a new network connection which potentially runs into a timeout.

infraweavers commented 4 years ago

@sni Absolutely, this basically just moves the problem into go, from naemon. We were thinking that probably the easiest solution for that is to limit the maximum number of async dupserver requests in-flight; based on the idea that we can assume if 1000 (or whatever value, tunable from the config) are still "sending" that another 1 or 2 would also end up hanging for a while.

sni commented 4 years ago

1000 hanging network connections and go routines is also quite a number. What about one go routine which periodically checks the network connection while stashing the results on a queue (with a limit). And as soon as the dupserver is available again, the queue is flushed. This way we only need memory to store the queue temporarily.

infraweavers commented 4 years ago

@sni sure.

Just to confirm what you mean, the way I interpret that is we'll change the SendResultDup to push onto an in-memory queue, then create a single go routine (possibly 1 per configured dupserver actually); whose job it is to pull items from that queue and go the actual work of sending them to the dupserver; with a limit of how large the number of items on that queue can be.

sni commented 4 years ago

right, that's what i meant. One per dupserver sounds like a good way to do it.

sni commented 4 years ago

closing this one, PR continues in #14

ConSol-Monitoring / mod-gearman-worker-go

Adding ability to send answer to dupservers asynchronously #13