basho / riak_repl

Riak DC Replication
Apache License 2.0
56 stars 32 forks source link

riak_repl2_leader needs to gracefully restart #427

Open bookshelfdave opened 10 years ago

bookshelfdave commented 10 years ago
(dev2@127.0.0.1)4> rp(sys:get_status(riak_repl2_leader_gs)).
{status,<0.2100.0>,
        {module,gen_server},
        [[{'$ancestors',[riak_repl_sup,<0.528.0>]},
          {'$initial_call',{riak_repl2_leader,init,1}}],
        running,<0.2092.0>,[],
        [{header,"Status for generic server riak_repl2_leader_gs"},
          {data,[{"Status",running},
                {"Parent",<0.2092.0>},
                {"Logged events",[]}]},
          {data,[{"State",
                  {state,<0.3018.0>,false,'dev3@127.0.0.1',
                        <21975.2112.0>,#Ref<0.0.0.12759>,
                        [#Fun<riak_repl2_fscoordinator_sup.set_leader.2>,
                          #Fun<riak_core_cluster_mgr.set_leader.2>],
                        ['dev1@127.0.0.1','dev2@127.0.0.1',
                          'dev3@127.0.0.1'],
                        [],
                        {interval,#Ref<0.0.0.7108>},
                        0}}]}]]}
ok
(dev2@127.0.0.1)5> exit(whereis(riak_repl2_leader_gs), kill).
true
(dev2@127.0.0.1)6> 06:42:19.231 [error] Supervisor riak_repl_sup had child riak_repl2_leader started with riak_repl2_leader:start_link() at <0.2100.0> exit with reason killed in context child_terminated

(dev2@127.0.0.1)6> 
(dev2@127.0.0.1)6> rp(sys:get_status(riak_repl2_leader_gs)). 
{status,<0.4119.0>,
        {module,gen_server},
        [[{'$ancestors',[riak_repl_sup,<0.528.0>]},
          {'$initial_call',{riak_repl2_leader,init,1}}],
        running,<0.2092.0>,[],
        [{header,"Status for generic server riak_repl2_leader_gs"},
          {data,[{"Status",running},
                {"Parent",<0.2092.0>},
                {"Logged events",[]}]},
          {data,[{"State",
                  {state,undefined,false,undefined,undefined,
                        undefined,[],[],
                        ['dev2@127.0.0.1'],
                        {interval,#Ref<0.0.0.20336>},
                        undefined}}]}]]}
ok
(dev2@127.0.0.1)7> riak_repl2_leader:leader_node().
'dev1@127.0.0.1'
(dev2@127.0.0.1)8> riak_core_cluster_mgr:get_leader().
'dev3@127.0.0.1'
jonmeredith commented 10 years ago

Either 1) All things that use repl2_leader events (like the cluster manager) need to die when the repl2_leader dies 2) the notification hooks need to be stored somewhere that can tolerate a repl2_leader death (ETS table that gets passed back to some supervisor) 3) Things that use repl2_leader events need to monitor it and re-register on restart. 4) repl2 leader events could be converted to use gen_event and reuse the riak_core_guarded_event_handler.

lordnull commented 10 years ago

I'm tending to favor 4 as it uses existing otp behaviors. It also models what we want most closely. I was unable to find riak_core_guarded_event_handler.

jonmeredith commented 10 years ago

Apologies, I meant https://github.com/basho/riak_core/blob/develop/src/riak_core_eventhandler_guard.erl

cmeiklejohn commented 10 years ago

Moving to 2.1.