Closed loucash closed 9 years ago
Hi @loucash I think you may indeed have found a bug. However, it require a couple more changes. The key problem as I see it is that calling rafter_log:append/2
always expects the index in the state to be the last written index and ends up calling rafter_log:write_entry/2
expecting that to be the case and incrementing the index. You noticed this, and hence insured that in your fix when write_entry/2
gets called it gets called with the last index and not the current index being written. This is correct, however I believe it may also need to be changed for the other calls to write_entry/2
in maybe_append/4
. I'll need to write some tests and spend a bit more time with this to ensure I'm correct.
Excellent find. Thank you very much for your contribution.
Currently I'm thinking of breaking out write_entry so that it doesn't increment the index as a side effect, and just doing that where needed. I think this slight refactor would make things cleaner than the random Index-1
and would negate the need for other changes to rafter_log:maybe_append/4
.
Also, I'm getting an error when trying to run your tests do to the way the headers are included:
Andrews-MacBook-Pro:lc-rafter ajs$ rebar ct skip_deps=true suites=rafter
==> lc-rafter (ct)
Showing log
--- Test run on 2015/04/04 16:19:28 ---
Converting "/Users/ajs/lc-rafter/." to "/Users/ajs/lc-rafter" and re-inserting with add_patha/1
Common Test v1.7.3 starting (cwd is /Users/ajs/lc-rafter)
Common Test: Running make in test directories...
Including the following directories:
"/Users/ajs/lc-rafter/include"
Recompile: rafter_SUITE
rafter_SUITE.erl:4: can't find include lib "rafter/include/rafter_opts.hrl"
rafter_SUITE.erl:5: can't find include lib "rafter/include/rafter.hrl"
rafter_SUITE.erl:28: record rafter_opts undefined
rafter_SUITE.erl:32: record config undefined
rafter_SUITE.erl:36: record config undefined
rafter_SUITE.erl:44: record rafter_entry undefined
rafter_SUITE.erl:47: record rafter_entry undefined
rafter_SUITE.erl:51: record rafter_entry undefined
{error,make_failed}
Recompile: rafter_system_test_eqc
Recompile: rafter_leader_election_eqc
Recompile: rafter_gen
Recompile: rafter_config_eqc
Recompile: rafter_backend_ets_eqc
CWD set to: "/Users/ajs/lc-rafter/logs/ERROR: Building tests failed
ct_run.test@Andrews-MacBook-Pro.local.2015-04-04_16.19.28"
TEST INFO: 1 test(s), 1 case(s) in 0 suite(s)
Testing ajs.lc-rafter.rafter_SUITE: Starting test, 1 test cases
Testing ajs.lc-rafter.rafter_SUITE: TEST COMPLETE, 0 ok, 0 failed, 1 skipped of 1 test cases
Updating /Users/ajs/lc-rafter/logs/index.html... done
Updating /Users/ajs/lc-rafter/logs/all_runs.html... done
ERROR: ct failed while processing /Users/ajs/lc-rafter: rebar_abortAndrews-MacBook-Pro:lc-rafter ajs$ rebar ct skip_deps=true suites=rafter
==> lc-rafter (ct)
Showing log
--- Test run on 2015/04/04 16:19:28 ---
Converting "/Users/ajs/lc-rafter/." to "/Users/ajs/lc-rafter" and re-inserting with add_patha/1
Common Test v1.7.3 starting (cwd is /Users/ajs/lc-rafter)
Common Test: Running make in test directories...
Including the following directories:
"/Users/ajs/lc-rafter/include"
Recompile: rafter_SUITE
rafter_SUITE.erl:4: can't find include lib "rafter/include/rafter_opts.hrl"
rafter_SUITE.erl:5: can't find include lib "rafter/include/rafter.hrl"
rafter_SUITE.erl:28: record rafter_opts undefined
rafter_SUITE.erl:32: record config undefined
rafter_SUITE.erl:36: record config undefined
rafter_SUITE.erl:44: record rafter_entry undefined
rafter_SUITE.erl:47: record rafter_entry undefined
rafter_SUITE.erl:51: record rafter_entry undefined
{error,make_failed}
Recompile: rafter_system_test_eqc
Recompile: rafter_leader_election_eqc
Recompile: rafter_gen
Recompile: rafter_config_eqc
Recompile: rafter_backend_ets_eqc
CWD set to: "/Users/ajs/lc-rafter/logs/ERROR: Building tests failed
ct_run.test@Andrews-MacBook-Pro.local.2015-04-04_16.19.28"
TEST INFO: 1 test(s), 1 case(s) in 0 suite(s)
Testing ajs.lc-rafter.rafter_SUITE: Starting test, 1 test cases
Testing ajs.lc-rafter.rafter_SUITE: TEST COMPLETE, 0 ok, 0 failed, 1 skipped of 1 test cases
Updating /Users/ajs/lc-rafter/logs/index.html... done
Updating /Users/ajs/lc-rafter/logs/all_runs.html... done
ERROR: ct failed while processing /Users/ajs/lc-rafter: rebar_abort
Hi @andrewjstone , thank you for your review!
I replaced include_lib
with include
in a suite (commit 60a3639eb1214e78cf61b61f30ffea5e6c6eb7e6), I couldn't reproduce this error, so please try again and let me know.
I concur with refactoring rafter_log:write_entry/2
and that Index-1
is not an elegant way (sic!). What do you think to just include an Index
in arguments in write_entry
?
My proposal: f08a57e9b0af47d935197aeb9347f74df26b3ba7
Cheers!
@loucash Yep, that's essentially what I"m thinking. It's been so long since I've touched this code that I decided to write a bunch of eunit tests and may even clean it up a bit more and rename some internal functions. I'm going to be playing around with it for a while tonight. I'll let you know what I come up with.
Cool, thanks :+1:
Hi @loucash. I spent quite a bit of time fixing this bug this weekend and in addition fixed another one. Both are in this PR. I think it's good to go. Please look it over and let me know what you think.
One more unrelated note. Rafter really is not ready to be used in production. As an example, you just found a dataloss bug that is part of the core functionality of the protocol. This at least shows that the system as a whole, and in parts requires a great deal more testing. Furthermore, there is absolutely no compaction being done currently. A long running system will have a potentially very long log. Even if you had infinite storage space, you would still have the problem that rebuilding state requires a complete replay of the log. For a few billion entries, this could take quite a while.
I personally would not even considering putting this into production without some sort of snapshotting capability and a lot more testing. Consensus by definition typically takes care of your most trusted data and operations, or else the expense wouldn't be worth it. I'd estimate about 1-2 months worth of fulltime, consecutive work to make rafter ready for production. It's hard getting back up to speed on a consensus algorithm after a year of no real development, so infrequent changes are likely to take longer.
A number of people have shown interest in rafter, and I will consider putting more time into it to get it to a place where I can confidently say: 'Yes go forth and use in production.' However, I strongly doubt that this will happen in less than a year given current interests and other development. If you really need something like this, I'd highly recommend using zookeeper, as it's fast and rock solid. If you really want something in Erlang, take a look at riak_ensemble. I've also worked on that and it's already part of a production database. It's different in that it doesn't provide a replicated log abstraction, and only paxos, and slightly harder to use, but it's available now.
Hi @andrewjstone, PR looks good to me. Thank you! Yeah, while I was learning rafter I saw those deficiencies and as a backup I plan to use riak_ensemble. It is just rafter has this nice, simple, clean api plus you can understand implementation fairly quickly. Anyway, If I notice some other problems I will let you know. Cheers.
Thanks @loucash . Those words mean a lot to me. I don't mean to be discouraging, just realistic. Enjoy playing around with rafter and I'll do my best to fix any problems that arise in a timely manner.
I merged #34 and tagged rafter at 0.2.3
Cool! Cheers @andrewjstone :+1:
Hey @andrewjstone, so I discovered that when a leader want to overwrite some entries in followers log it sets a wrong index.
An example:
term=1
andindex=1
and leader sends an entry withterm=2
andindex=1
;term=2
andindex=2
;term=2
andindex=1
.I attached a test with this case.
Sorry, I do not have a eqc license, so I couldn't run eqc tests.
And fyi, I am also working on proper statem for rafter_log, as I want to use rafter on production (really badly). Will ping you when it is ready.