Open mschoch opened 9 years ago
Seems like the exponential backoff function in cbdatasource is the most likely place. Still digging, any thoughts @steveyen ?
While it was repeatable for a while... as I added code to try and track it down, it has now stopped happening...
Also, I'm not entirely sure cbft is to blame. It seems like it could be anything using go-couchbase (or a fork of go-couchbase)
Yeah, that logging message text of "Go 1.1 package http" is throwing me off a bit.
When I try it on my standalone dev build of cbft (where I point it at couchbase, create a bleve index, kill ns-server), it works ok without spamming cbft's log output. But that's just using the simple Cfg provider.
Perhaps metakv? (Still, why would it be go 1.1?)
btw, one way to find out your go version from cbft is via...
curl http://localhost:8095/api/runtime
Some limited docs on those diagnostic REST endpoints in cbft are here: http://labs.couchbase.com/cbft/api-ref/#node-diagnostics
One final clue is to figure out who logs/printf's in that kind of format. cbft's logs currently normaly look like...
2015/07/29 07:57:13 janitor: adding pindex: bs_2dd16fba4b47b833_ecb637fa
I have looked into the odd User-Agent. That is what Go sends by default. It does NOT correspond to the Go version.
Regarding these messages, this is ns_server logging the requests coming in.
I have separately opened up #161 because cbft messages don't end up anywhere.
OK, I've added a line "refreshing cluster" and confirmed this is where this happening.
Its easiest to repeat when I create index, let it finish, kill all servers. Wait 1 minute, then bring everything backup.
It seems like it might get unstuck eventually.
NOTE: it seems like about 50 or so of these messages in a burst is NORMAL. But, what I'm reporting is continuous stream of them for > 1 minute.
I just checked, in the most recent version of this, the cluster was refreshed 1927 times before it stopped.
Taking a look at the cbdatasource codepaths around refreshCluster, there's no obvious smoking sore thumb sticking out.
One thought, though, there's no "de-bounce"... that is if there's a storm of refreshCluster requests, they all get run. Instead, concurrent refreshCluster requests that are all piled up on the channel should be all coallesced and cleared by the next actual refreshCluster.
Another thought, was metakv being used? If so, perhaps the brand new CfgMetaKV codepaths need some shake out. In particular, perhaps the cfg change subscription codepaths might be fruitful.
Finally, the next time this happens, grabbing some of the stats counters would be helpful, which can be done via...
curl http://localhost:9200/api/stats
...where 9200 is the cbft PORT number.
I notice this sometimes when cbft is started from ns_server. It seems to NOT happen when I first create an index. But rather, after I kill ns_server (and thus everything else) and later restart it. It seems like CBFT isn't happy with something it sees, so it keeps retrying. It burns a lot of cpu, as I eventually hear the fan.
On the console I see: