couchbase / sync_gateway

Manages access and synchronization between Couchbase Lite and Couchbase Server
https://www.couchbase.com/products/sync-gateway
Other
447 stars 138 forks source link

CBGT: Dead node detection autofailover #1075

Closed zgramana closed 9 years ago

zgramana commented 9 years ago

As documented here, couchbaselabs/cbgt#22,

CBGT does provide a way to do this via an API. However, that would only apply for SG nodes taken down intentionally.

tleyden commented 9 years ago

After speaking to @steveyen, it appears there are a few options:

tleyden commented 9 years ago

Requirements

tleyden commented 9 years ago

@steveyen -- need some help debugging this.

I have two sync gateways running, and the CFG is currently:

https://gist.githubusercontent.com/tleyden/bffa2eae29f1dfa9ece8/raw/60c8528ad34da2c09da9d4465cf48153e0344967/gistfile1.txt

Immediately after killing one of the Sync Gateways, the CFG is:

https://gist.github.com/tleyden/c5e1c452fa3e95a6900c

and in the remaining Sync Gateway, the following is shown in the logs:

https://gist.github.com/tleyden/0bc928851cf96343f536

and the final CFG is:

https://gist.github.com/tleyden/e9b481d63f3fb4144b81

the full logs for SG1 are here:

https://gist.github.com/tleyden/f956a682dfc7f659c8da

steveyen commented 9 years ago

Hi Traun, Looks like the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_KNOWN worked, but the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

Are the calls to cbgt.CfgRemoveNodeDef() returning any errors?

Or, if not already, added some error logging for more diagnosis?

More info on the two calls with NODE_DEFS_KNOWN and NODE_DEFS_WANTED are here...

https://github.com/couchbaselabs/cbgt/issues/26

tleyden commented 9 years ago

Here's the code:

https://github.com/couchbase/sync_gateway/blob/feature/distributed_index_autofailover/src/github.com/couchbase/sync_gateway/base/sgw_pindex.go#L306-L326

it has error checking, and I don't see the error being emitted.

invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

btw, how did you determine that? Let me know if you have any tips on how to debug aside from checking the errors.

tleyden commented 9 years ago

WHOOPS, I see a bug.

Should be

for kind := range kinds {
        log.Printf("call cbgt.CfgRemoveNodeDef with nodeuuid: %v cfg: %+v", nodeUuid, h.Cfg)
        if err := cbgt.CfgRemoveNodeDef(
            h.Cfg,
            kind,  <----!!
            nodeUuid,
            h.CbgtVersion,
        ); err != nil {
            log.Printf("Warning: attempted to remove %v (%v) from CBGT but failed: %v", nodeUuid, kind, err)
        }

    }
steveyen commented 9 years ago

On the debugging, from the Cfg snapshots that you had gist'ed, I saw that one node had disappeared from nodesKnown but not from the nodesWanted section of the Cfg.

tleyden commented 9 years ago

@steveyen - The autofailover is still not working -- here are the steps to reproduce:

Actual sg1 is only getting DCP updates for vbuckets where vbucket id >= 512, but nothing for vbucketid < 512, which are the vbuckets that sg2 was handling. Expected sg1 should be getting DCP updates for all vbuckets after sg2 was shut down

What debug info can I collect?

steveyen commented 9 years ago

First suspicion... looks like a legit cbgt bug (!), as I just saw something similar with a cbft distributed cluster.

steveyen commented 9 years ago

Yeah, I think there's maybe a bug in the cbgt code that recalculates the DCP streams (CalcFeedsDelta). Looking through it.

steveyen commented 9 years ago

Please see CBGT fix, commit a119e02 which corrects the CalcFeedsDelta for the scenario of a shrinking cluster.

tleyden commented 9 years ago

Cool! Thanks, will try it and let you know.

tleyden commented 9 years ago

It works now!

Here are the Sync Gw logs: