andrewjstone / rafter

An Erlang library application which implements the Raft consensus protocol
269 stars 32 forks source link

multiple state machines #6

Closed jschoch closed 11 years ago

jschoch commented 11 years ago

I'm thinking having multiple leaders in a cluster for different roles would be a good way to balance load around a cluster. Any thoughts on how to associate a role to a cluster so you could do something like

forgive my syntax i'm learning elixir and not quite up to speed with erlang roles = [ logger: [server1,server2,server3], queue:[server3,server2,server1], cron: [server2, server1,server3] ]

is this possible, or are the peer1..5 atom's hard coded and assumed to be in a singular cluster

andrewjstone commented 11 years ago

All nodes in a raft consensus group are equal. No roles are hardcoded. If a leader fails another leader will be elected by the remaining members of the group. If all nodes are up to date, election is essentially random and any of the alive nodes can become leader.

There is no way to designate a specific leader, and you really wouldn't need/want to in most situations. The whole point of leader election is to allow for HA and consistency. If you tie the leaders to given nodes, when those nodes go down your system is down.

jschoch commented 11 years ago

well i guess instead of

get_leader(peer1)

i'd like

get_leader(role,peer)

this way you can have different severs be leaders for different roles, the failover and HA would work exactly the same, but a single failure would only effect a single role and require an election for that role, all other roles woudl be unaffected if the server that failed was not the leader for that role at that time.

andrewjstone commented 11 years ago

@jschoch Unfortunately that's not how the algorithm works. However, you can certainly start up multiple consensus groups on the same nodes however. Each peer is only a few erlang processes so you could theoretically run thousands of them. You'd likely hit memory/bandwidth/disk limits well before exceeding the number of processes Erlang allows you to run.

Once the peers are started you can then do the following:

get_leader(group1_peer1).
get_leader(group2_peer1).

This has the additional benefit of allowing each of those groups to use different state machines and stores data in different log files.

sumerman commented 10 years ago

@andrewjstone Please elaborate on this in the README — although one can figure out to give disjoint groups a spin, correctness of such actions isn't obvious and may be implementation dependent.

Furthermore it may be very handy if an arbitrary term would be allowed as a peer name (e.g. {group, I, peer, J} where I and J are integers), this capability would liberate client's code from tons of integer_to_list and list_to_atom. Mentioned behaviour implementable with a minimal effort using gproc in non-distributed mode and {via, Module, ViaName} feature of gen_fsm.

andrewjstone commented 10 years ago

Hi @sumerman,

I do plan at some point to write about, and work on a consensus group manager that will allow handling multiple disjoint consensus groups. I don't ever expect them to coordinate among themselves however. They would be used instead for naturally partitioned data. For instance each consensus group would handle managing data for one specific business customer. You could do aggregate calculations on the groups together, but the aggregates wouldn't be consistent.

I'm actually not even sure this manager should be part of rafter and not just part of the application using rafter. This is sort of beyond the scope of implementing the raft algorithm, so I've punted on it until the rest of the code is production ready, which will probably be a while considering it still needs compaction, anti-entropy, better log performance etc....

As far as using arbitrary terms, I agree that this would indeed limit type conversions when implementing a non-erlang frontend. However, the code is simpler, without gproc. That said, I'm open to changing it. I'd probably ignore the I and J since there is only one group as described avove and just use a term as the name directly. If you want to send a patch I'll take a look.