Open ShiKaiWi opened 4 years ago
It seems this condition is wrong(https://github.com/buraksezer/consistent/blob/master/consistent.go#L163):
if count >= len(c.sortedSet) {
Is it reanable to remove equal?:
if count > len(c.sortedSet) {
Hello,
Your configuration is invalid. So this is the expected behavior. You should use something like that:
// Create a new consistent instance
cfg := consistent.Config{
PartitionCount: 7,
ReplicationFactor: 20,
Load: 1.25,
Hasher: hasher{},
}
Configuration section in README file explains these variables https://github.com/buraksezer/consistent#configuration
If you explain your use case, I may help further for the configuration.
Thanks!
I have checked the definition of variables in consistent.Config
but I still can't understand why the case is wrong:
cfg := consistent.Config{
PartitionCount: 1,
ReplicationFactor: 1,
Load: 1,
Hasher: hasher{},
}
c := consistent.New(nil, cfg)
node1 := myMember("node1")
c.Add(node1)
One node with one replication for one partition seems reasonable.
@buraksezer is it possible for you to fix what @ShiKaiWi is reporting? i'm running into a similar issue as well. this patch seems reasonable and that's what i did to keep going.
we may want to run (unit/functional) tests where we try to squeeze a small number of partitions on a single node and the logic here does seem incorrect. the count
that checks to see if we have exhausted all nodes
should be incremented only after we check to see if the partition cannot be accommodated - not before. either the count
should be incremented at the end of the loop or the patch that @ShiKaiWi provides should be done.
if you wish, i can create a pr as well. please let us know.
@sriramch firstly, I'm so sorry for the late response. The reported configuration by @ShiKaiWi was invalid. Why do you want to use a consistent hash function if you have only one partition? ReplicationFactor
and Load
parameters should be bigger than 1. Furthermore, these parameters should be chosen wisely.
If you provide a sample configuration, I may understand what the actual problem is. I'm OK with the changes, if it fixes a real problem.
@buraksezer thanks for responding!
as i mentioned earlier, we may want to run unit/functional tests with a smaller number of partitions on one or few nodes with just 1 replica to simulate some failure scenarios. for instance, the following would panic.
here the PartitionCount
and the Load
are indeed > 1.
numNodes := 2 // 1 or 2
numReplicas := 1
members := []consistent.Member{}
for i := 0; i < numNodes; i++ {
member := Member(fmt.Sprintf("node%d.olricmq", i))
members = append(members, member)
}
cfg := consistent.Config{
PartitionCount: 10, // partition range for modulo
ReplicationFactor: numReplicas, // how many times each member is replicated
Load: 1.1,
Hasher: murmur3_hasher{},
}
c := consistent.New(members, cfg)
my earlier comment explains the cause and a potential fix. why can't it be accommodated, and why is it invalid to have a non trivial number of partitions on one or a couple of nodes with one replica? why is this incorrect conceptually?
shouldn't the count
(that tracks if we have exhausted all nodes) be incremented only after checking to see if a partition cannot be accommodated on the node as opposed to before as it is currently done?
@ShiKaiWi @sriramch I think @buraksezer is right.
math.Ceil(c * m / n)
. On the other hand though I recommend we can pass an error instead of panicking. We can also check in the constructor if the parameters is valid.count
is incremented before. This means only n-1
nodes is examined before exhaustion. Meanwhile for @buraksezer I have a mathematical question on the parameter settings. Is it possible to derive mathematically that a given the values for parameters ReplicaFactor
, PartitionCount
, and Load
would be valid or invalid?
In addition to that, what would be a good parameter set for production environment? (good as in, just valid. performant wise I can definitely benchmark test). Let's say on average, I will have 5.6 members. Variance 1.2 memberSquare.
@iqDF - thanks for your comments.
If you want a bounded load you cannot put whatsoever edge cases that breaks the rule
math.Ceil(c * m / n)
that is an implementation detail. as i asked earlier, conceptually why is distributing a non trivial number of partitions over one or a couple of nodes with one replica incorrect (with a load factor > 1)?
But does fixing (2) also fixes the panic in (1)?
it should and i have mentioned it here as well last line in the 2nd paragraph
There are plenty of other parameter configuration that won't fix the panic regardless we fix the counter or not because the underlying parameters just doesn't make sense for load balancing.
can you please provide some examples? i was under the impression that we should be able to accommodate p
partitions on n
nodes with r
replicas (for values of p/n/r
that conceptually makes sense) by adjusting the load factor
I am hitting this same issue, config is:
cfg := consistent.Config{
PartitionCount: 100,
ReplicationFactor: 1
Load: 1.25,
Hasher: hasher{}
}
I add a single member, and get the same panic as above.
The panic occurs because replicationFactor is 1, so, after a single member is added there is only one entry in c.sortedSet.
This means count is always equal to len(sortedSet):
count++
if count >= len(c.sortedSet) {
// User needs to decrease partition count, increase member count or increase load factor.
panic("not enough room to distribute partitions")
}
And it always panics. Note that here the load has not been exceeded - load = 100, maximum load = 125.
What am I doing wrong here?
Hey, @purplefox ReplicationFactor
should be bigger than 1. Why do you want to set 1 to ReplicationFactor
? Setting a ReplicationFactor
bigger than 1 is an essential factor to distribute load evenly between members.
I think we should add a check that validates the configuration.
I am creating a non replicated, but distributed cache. There are multiple members in the cluster and a particular key should live on exactly one member. I am using consistent hashing to determine which member the key lives on. A key only lives on one member so it is not replicated, so replication factor is 1. This seems like a very common use case for consistent hashing (e.g. redis and many other things)
ReplicationFactor
is just an implementation detail of this package. Replicating data is your project's implementation detail. You can use the default ReplicationFactor
value and LocateKey
method to find a host for the key.
You should know that this package implements a modified version of this algorithm for Olric.
Thanks. If replication factor does not mean what I thought it meant, what does it represent?
It's related to the consistent hashing technique: https://ably.com/blog/implementing-efficient-consistent-hashing
I think you're referring to a technique in consistent hashing where a server node is added multiple times on the ring (not just once) in order to make distribution more uniform. I am familiar with this technique but never heard it called "replication factor" before. IMO using the term "replication factor" is somewhat confusing as that term already. has a well known meaning in distributed systems. Thanks for the quick reply.
This demo will panic:
Here is the error message:
So this is expected behavior or a bug?