Closed reugn closed 2 years ago
The PR is functionally accurate, but includes one extra unnecessary index++ increment. It will still be accurate if this extra index++ increment is removed.
I'd say the older algorithm would distribute the load better considering we look for a random node. We should probably use the XORShift and just select the first index randomly to do it even better.
I think the method name getRandomNode() is a little misleading because the real intention is to distribute single node requests evenly across the cluster. getRoundRobinNode() would have been more accurate.
I don't have a strong opinion on the algorithms because they are identical in the most common case where all nodes are active in the cluster. Nodes are only set to inactive just prior to be being removed, so the lag is in milliseconds. Falling into this inactive node logic is extremely rare.
The active node flag's main use is in partition maps where the lag between inactive nodes and other nodes claiming partition ownership of the inactive node's partitions is much greater.
This is accurately the proposal of this PR.