Open d0rc opened 10 years ago
@d0rc I'm having trouble reproducing this issue. I've killed nodes, given bad commands, restarted all nodes etc... I saw something similar to this with bugs in the past, but not recently. If you have the logs and the nodes are still crashing like this can you stop the nodes, tar up the data directory and email it to me? astone AT basho Dot Com
Additionally, did you have existing data, stop the nodes, pull a new version from github and start again? I'm just trying to nail down potential causes. You can likely clear this up and hopefully not seeing it again by using the latest code, wiping your data (which appears to be test data anyway) and starting again.
Thanks for reporting this. Hopefully I'll get it sorted out.
Without knowing about this issue, I ran into this problem. I wrote an expect script to reproduce it.
Basically, I tried to append an entry to the log when the leader had lost contact with its two followers.
Unfortunately, it's not possible to attach files to comments, so I have to paste it inline:
spawn ./bin/start-node peer1 set p1 $spawn_id
spawn ./bin/start-node peer2 set p2 $spawn_id
spawn ./bin/start-node peer3 set p3 $spawn_id
expect -i $p1 "1>" expect -i $p2 "1>" expect -i $p3 "1>"
sleep 3
send -i $p1 "Peers = [{peer1, 'peer1@127.0.0.1'}, {peer2, 'peer2@127.0.0.1'}, {peer3, 'peer3@127.0.0.1'}].\n" expect -i $p1 "2>" send -i $p1 "rafter:set_config(peer1, Peers).\n" expect -i $p1 "{ok,{config,stable" {} timeout { exit } expect -i $p1 "3>"
sleep 2
send -i $p1 "rafter:getleader(peer1).\n" expect -i $p1 -re ".\n(._)\r\n.*4>" set leader $expect_out(1,string) send_user "\nleader: leader\n"
send -i $p1 "rafter:op(rafter:get_leader(peer1), {new, ourtable}).\n" expect -i $p1 "5>"
send -i $p1 "rafter:op(rafter:get_leader(peer1), {put, ourtable, foo, 1}).\n" expect -i $p1 "6>"
sleep 2
if { $leader eq "{peer1,'peer1@127.0.0.1'}" } { set leader_p $p1 set leader_name "peer1" } elseif { $leader eq "{peer2,'peer2@127.0.0.1'}" } { set leader_p $p2 set leader_name "peer2" } elseif { $leader eq "{peer3,'peer3@127.0.0.1'}" } { set leader_p $p3 set leader_name "peer3" }
foreach { id } [list $p1 $p2 $p3] { if { $leader_p != $id } { expect -i $id ">" send -i $id "halt().\n" } }
send_user "\nkilled everyone except leader\n"
sleep 3
send -i $leader_p "rafter:op(rafter:get_leader(peer1), {put, ourtable, foo, 2}).\n" expect -i $leader_p "7>"
sleep 2
send_user "\nall commands executed\n" send -i $leader_p "halt().\n" expect -i $leader_p ">" sleep 1
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { spawn ./bin/start-node $name set id$name $spawn_id } }
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { expect -i id$name "1>" } }
sleep 3
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { send -i id$name "{}.\n" } }
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { expect -i id$name "2>" } }
spawn ./bin/start-node $leader_name set $leader $spawn_id expect -i $leader "1>"
sleep 3
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { send -i id$name "{}.\n" } }
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { expect -i id$name "3>" } }
send -i $leader "{}.\n" expect -i $leader "2>"
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { send -i id$name "halt().\n" } }
foreach { name } [list peer1 peer2 peer3] { if { $leadername != $name } { expect -i id$name ">" } }
I can confirm this, using Erlang/OTP 17 and rafter 4dbbb7572a4dc4f16fb164b4ddbe6bd56495e765. I see things like these:
08:44:35.501 [error] gen_fsm peer1 in state follower terminated with reason: no case clause matching {ok,not_found} in rafter_consensus_fsm:'-commit_entries/2-fun-0-'/5 line 599
Magic Number found at 514
08:44:35.502 [error] CRASH REPORT Process peer1 with 0 neighbours exited with reason: no case clause matching {ok,not_found} in rafter_consensus_fsm:'-commit_entries/2-fun-0-'/5 line 599 in gen_fsm:terminate/7 line 620
08:44:35.503 [error] Supervisor peer1_sup had child rafter_consensus_fsm started with rafter_consensus_fsm:start_link(peer1, {peer1,'peer1@127.0.0.1'}, {rafter_opts,rafter_backend_ets,"./data"}) at <0.85.0> exit with reason no case clause matching {ok,not_found} in rafter_consensus_fsm:'-commit_entries/2-fun-0-'/5 line 599 in context child_terminated
08:44:35.527 [error] gen_fsm peer1 in state follower terminated with reason: no case clause matching {ok,not_found} in rafter_consensus_fsm:'-commit_entries/2-fun-0-'/5 line 599
Magic Number found at 591
Honestly, at this point, I've pretty much ceased development on rafter. I'm not sure when I'll really have time to dig into this issue. Rafter is definitely not a production ready project. There are a lot of rough edges, and I don't really have a use case at the moment enticing me to work on it more.
Since I've stopped working on rafter I've poured a lot of my energy into Riak Ensemble. It is a production ready consensus protocol that is in use in Riak 2.0 to provide atomic single key operations. While it is different from rafter in that it doesn't provide a globally ordered log, there is no reason a log cannot be built on top of riak ensemble. Additionally, Riak Ensemble provides leader leases allowing 0 round trip reads, and built in integrity trees that protect against some byzantine failure scenarios. It also manages multiple ensemble groups instead of the one managed by rafter. On the downside it requires rewriting active keys on epoch changes and maybe isn't quite as user friendly to get started. The big advantage however, is that it is production ready now and in use in a soon to be released Riak 2.0.
Here is full failure log and log of peer3 after restart attempt: