Closed zgramana closed 9 years ago
After speaking to @steveyen, it appears there are a few options:
@steveyen -- need some help debugging this.
I have two sync gateways running, and the CFG is currently:
Immediately after killing one of the Sync Gateways, the CFG is:
https://gist.github.com/tleyden/c5e1c452fa3e95a6900c
and in the remaining Sync Gateway, the following is shown in the logs:
https://gist.github.com/tleyden/0bc928851cf96343f536
and the final CFG is:
https://gist.github.com/tleyden/e9b481d63f3fb4144b81
the full logs for SG1 are here:
Hi Traun, Looks like the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_KNOWN worked, but the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.
Are the calls to cbgt.CfgRemoveNodeDef() returning any errors?
Or, if not already, added some error logging for more diagnosis?
More info on the two calls with NODE_DEFS_KNOWN and NODE_DEFS_WANTED are here...
Here's the code:
it has error checking, and I don't see the error being emitted.
invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.
btw, how did you determine that? Let me know if you have any tips on how to debug aside from checking the errors.
WHOOPS, I see a bug.
Should be
for kind := range kinds {
log.Printf("call cbgt.CfgRemoveNodeDef with nodeuuid: %v cfg: %+v", nodeUuid, h.Cfg)
if err := cbgt.CfgRemoveNodeDef(
h.Cfg,
kind, <----!!
nodeUuid,
h.CbgtVersion,
); err != nil {
log.Printf("Warning: attempted to remove %v (%v) from CBGT but failed: %v", nodeUuid, kind, err)
}
}
On the debugging, from the Cfg snapshots that you had gist'ed, I saw that one node had disappeared from nodesKnown but not from the nodesWanted section of the Cfg.
@steveyen - The autofailover is still not working -- here are the steps to reproduce:
Actual sg1 is only getting DCP updates for vbuckets where vbucket id >= 512, but nothing for vbucketid < 512, which are the vbuckets that sg2 was handling. Expected sg1 should be getting DCP updates for all vbuckets after sg2 was shut down
What debug info can I collect?
First suspicion... looks like a legit cbgt bug (!), as I just saw something similar with a cbft distributed cluster.
Yeah, I think there's maybe a bug in the cbgt code that recalculates the DCP streams (CalcFeedsDelta). Looking through it.
Please see CBGT fix, commit a119e02 which corrects the CalcFeedsDelta for the scenario of a shrinking cluster.
Cool! Thanks, will try it and let you know.
As documented here, couchbaselabs/cbgt#22,
CBGT does provide a way to do this via an API. However, that would only apply for SG nodes taken down intentionally.