cbft manager issues? not creating plan pindexes?

mschoch commented 9 years ago

I wanted to watch the stats, which is most interesting when first indexing a bucket. I already had an index, so I first deleted it. First sign of trouble, page never refreshed after deleting like normal. I manually refreshed, and it showed 0 indexes, so I continued. I created new index, gave it different name. Second sign of trouble, page never refreshed, so I manually went back to list of indexes, saw it and clicked on it. Count shows "error", pressing refresh doesn't help. The UI/logs show:

pindex: no planPIndexes for indexName: bx

Tried to kick the manager, again page never refreshed, and logs never mentioned the kick message I had provided. At some point Chrome started to complain "waiting for available socket". This happens when the browser has already used up its 6 connections it will make to a single server. This to me suggested that we now had at least 6 asynchronous requests from javascript waiting for something. Closing that tab and starting over in another tab allowed the UI to work again.

A copy of the config was captured here: https://gist.github.com/mschoch/731b6ef28d01ba02809a

Eventually I killed ns_server, brought everything back up, and cbft is now behaving normally again. It even proceeded to build the index I had defined before, but it failed to build the last time.

steveyen commented 9 years ago

From the gist snapshot of the config, it looks like the planner never got around to running.

So, that log entry of "no planPIndexes for indexName: bx" wasn't lying! :-)

Related, from chat with @mschoch he mentioned he was not seeing signs of the infinite "forever refreshCluster" issue at the time of the weirdness.

steveyen commented 9 years ago

Another thought, on "tried to kick the manager"... sounds like the kicking was attempted via the browser, which might have already been in some strange state.

If this happens again, might be worth trying a kick via the cmd-line, such as via...

curl -X POST http://localhost:9200/api/managerKick

...and see if that gets the planner to, you know, do some actual PIndex planning.

And, another question: were you also using metakv during this scenario?

Finally, I'm thinking for the future, instead of all the question back-&-forth ping pong, maybe the more efficient way for any of these mysterious is just to capture all the diagnostics from the cbft node, via...

curl http://localhost:9200/api/diag

...and gist it up somewhere. Warning, the "diag" output is a remarkably big JSON output, as it was meant to capture everything I could think of. Will be fun to keep on improving "diag" with more stuff as we gain more battle scars from mysterious situations like this.

steveyen commented 8 years ago

I did fix some bugs related to pindex deletion in the manager hashtables, so I suspect that might be related. Closing this on the hope that was it.

steveyen commented 8 years ago

Oh, and also the fix was the couchbase cbft/cbgt rather than on couchbaselabs.

couchbaselabs / cbft

cbft manager issues? not creating plan pindexes? #162