Open rzezeski opened 12 years ago
We had the same problem in the verify_listkeys riak test. In our test, we have just switched from waiting on the service on that node to waiting for node watcher to know about the service in all nodes in the cluster.
+1
What should we do here?
Not sure yet. Writing a riak test to prove/disprove it could be an issue for a user would be the first step.
Time for some belated repo curating ... is there any plan to merge this patch & have a test for Riak 2.0?
I don't believe there is a patch for this @slfritchie
There is no patch for this and this issue no longer affects Yokozuna because it now works differently. That said, I'm fairly certain this is still something that could bite people in the future but I'm not sure it's enough of a concern to do anything about for the foreseeable future.
ok, moving to milestone 2.1
for now
Today I have been able to reproduce the error below on every run of the yokozuna test. Currently, during yokozuna application startup, a KV get is performed. This get is performed after it waits for the
riak_kv
service but the preflist still comes up empty, i.e.[]
.The issue is that at the moment the get is performed dev4 (which has just joined a 3-node cluster) owns none of the ring but is the only node in the
riak_core_node_watcher
. ThusUpNodes
will consist of only dev4 and nothing from the ring will match--ultimately resulting in an empty preflist. The second paste below shows a print from riak_core_apl showing the ring with no dev4 owner and a node watcher with only dev4.So the node watcher lags behind the ring update causing the node to temporarily have [] preflist for everything. I'm not sure this is easily solvable. We could add another stage to a node transition that lets it get ready before taking on requests but I'm not really sure what that all entails. I just wanted to dump my findings here before I lost the motivation to do so.
Error
riak_core_apl print out