ConSol-Monitoring / mod-gearman-worker-go

Mod-Gearman Worker rewrite in Golang
GNU General Public License v3.0
7 stars 10 forks source link

Host checks get orphaned and "lost" under 1.2.3 #19

Closed infraweavers closed 1 year ago

infraweavers commented 1 year ago

Hiya,

We've just upgraded to OMD 5 (which includes 1.2.3) and we are finding that our host checks end up getting "stuck" in the running state according to naemon (i.e. the spinner is there) and then after a period of time, chunks of hosts go "down" because of (host check orphaned, is the mod-gearman work on queue 'host' running?) which it is.

We downgraded to 4.40 (where we came from and everything is good). Then upgraded to 4.60 and it's also behaving like this on there, so presumably the change is somewhere between 1.1.5 and 1.2.1. We have also found that if we copy mod-gear from 4.40 and symlink it in, the behaviour goes away; so we're pretty confident it's the worker:

OMD[default@OMDA02]:~/bin$ ls -la mod*
-rwxr-xr-x 1 root    root      18800 Aug 12 18:49 mod_gearman_mini_epn*
-rwxr-xr-x 1 root    root     131576 Aug 12 18:49 mod_gearman_worker*
lrwxrwxrwx 1 root    root         27 Feb  2 16:50 mod_gearman_worker-go -> mod_gearman_worker-go-1.1.5*
-rwxr-xr-x 1 default default 8604472 Feb  2 15:56 mod_gearman_worker-go-1.1.5*
-rwxr-xr-x 1 root    root    9027672 Aug 12 18:49 mod_gearman_worker-go.1.2.1*

The "stuck" checks, always seem to have a passive result submitted for them: image Now that indicates to me that the problem is something to do with dupserver, I think the main check results and the dupserver results are getting "mixed up" or something and that's causing this behaviour.

Further evidence that the worker is the cause is that the worker.log contains things like

incoming host job: handle: H:<blah>:4126407 - host: <blah> - service:

But when the behaviour is happening, we never see the correllating job: H:<blah>:4126407 finished. Which sort of points towards them going missing or something.

We're going to build 1.3.0 this morning and see if that still shows the behaviour, if it does then we'll start working through the builds to find the breaking release/commit.

sni commented 1 year ago

strange, have you tried increasing the loglevel?

infraweavers commented 1 year ago

strange, have you tried increasing the loglevel?

Not yet, we'll do that whilst we play with versions and stuff

infraweavers commented 1 year ago

We have got some logs for when it is happening, they are 45MB compressed; so they're possibly a little difficult to run through.

We've just chopped through the releases and the issue appears between 1.1.6 and 1.2.0; we'll start running through the commits now

infraweavers commented 1 year ago

So running through the commits, working: https://github.com/ConSol/mod-gearman-worker-go/commit/4f07216106d9cbf5f0431bffc879fe904188d131 Seems to break gearmanitself: https://github.com/ConSol/mod-gearman-worker-go/commit/ba38fa5607a5270e22e7b56a4eabcebc3dfe4815 Causes orphans: https://github.com/ConSol/mod-gearman-worker-go/commit/c3b10fa98a57dd6e9f48b714d7e08d505220d26e

So I'm thinking it's around those 2 commits that are doing it.

We used: https://github.com/ConSol/mod-gearman-worker-go/compare/v1.1.6...v1.2.0 to get the list of commits and started from the oldest. We haven't tested any after c3b10fa98a57dd6e9f48b714d7e08d505220d26e yet so there could also be another commit that fixes it etc

infraweavers commented 1 year ago

I've had a very quick read through the code this morning and I think the problem is that the active and duplicates are sharing the same instance of item. I believe this wasn't a problem initially because everything was synchronous, now that there is a queue for results for both, there is a chance that things will be done out of order and the dupserver will run before the ServerResult.

https://github.com/ConSol/mod-gearman-worker-go/blob/master/worker.go#L218

I think this would explain why the hanging items that come onto the main server are being shown as passive as well (I think anyway).

https://github.com/ConSol/mod-gearman-worker-go/blob/master/dupserver.go#L85

sni commented 1 year ago

good point, that would explain the passive result. But why would they completely miss?

infraweavers commented 1 year ago

I'm guessing that they don't match because they've been changed to passive? I'm not sure where that lives, I guess mod_gearman