ConSol-Monitoring / mod-gearman-worker-go

Mod-Gearman Worker rewrite in Golang
GNU General Public License v3.0
7 stars 10 forks source link

gearmand failure on a dupserver breaks check execution #3

Closed infraweavers closed 5 years ago

infraweavers commented 5 years ago

Hello,

Our use-case for the mod-gearman-worker is an HA setup, we have 2 omd instances configured and each's dupserver is set to the other in the worker.cfg. We have a server called omd1 configured to perform active checks and a server called omd2 not configured to perform active checks. It appears that when we've upgraded to omd 2.90 on both we've moved onto using the go worker and this model has fallen apart :( !

We find that if we attempt to run omd1 standalone with dupserver configured against omd2 but gearmand stopped on omd2 then the checks will never complete, the number of jobs running on the services queue in gearman_top escalates and the results are never submitted back to omd1. It would appear that this behaviour is in https://github.com/ConSol/mod-gearman-worker-go/blob/b02a9dc3660ee23cb411afeec19eeda13558ecd7/worker.go#L207 as this loop will retry each dupserver 120 times and wait for 1 second between attempts.

We were planning to downgrade to omd 2.80 and return to the c worker, however we can't find the debian package on the repo anymore :(

Have we missed a way of configuring this to make it work?

Thanks, Rob

sni commented 5 years ago

the c worker is still included in the omd package, its just not the default anymore. Simply remove the -go suffix in the etc/init.d/gearman_worker start script.

infraweavers commented 5 years ago

Ah excellent, we'll give that a shot in a minute. Do you have any suggestions re: how to resolve the behaviour in the go worker? We were considering changing the 120 and 1 second to be configurable, then we could set it to 0 retries or 1 retry and 100ms

sni commented 5 years ago

i am fine with simply removing the retries on dupserver completly.