Node added to cluster with realtime replication enabled and started does not connect to sink

jcapricebasho commented 10 years ago

When adding a node to a cluster that is already replicating via realtime replication, the added node does not successfully connect to the sink until it is stopped and started (waiting for transfers from the join to complete.)

sinks: []
sources: [{source_stats,
              [{pid,"<0.5576.0>"},
              {message_queue_len,0},
              {rt_source_connected_to,
                  [{source,"CA"},{pid,"<0.5576.0>"},{connected,false}]}]}]
fullsync_coordinator: []
fullsync_coordinator_srv: []
cluster_name: <<"XV">>
cluster_leader: 'dev5@127.0.0.1'
connected_clusters: [<<"CA">>]
realtime_queue_stats: [{bytes,768},
                      {max_bytes,104857600},
                      {consumers,[{"CA",
                                    [{pending,0},
                                    {unacked,0},
                                    {drops,0},
                                    {errs,0}]}]},
                      {overload_drops,0}]
proxy_get: [{requester,[]},{provider,[]}]

I have tested this with 1.4.3 and 1.4.2 that included the 1.4.3 backported beams from here: https://github.com/basho/internal_wiki/wiki/Riak-1.4.2-release---beams-for-openx

I used the below script to set up my two clusters:

#!/bin/bash

echo "Attempting to activate Erlang using default kerl settings..."
. /opt/erlang/r15b01/activate

command -v erl >/dev/null 2>&1 || { echo >&2 "Erlang is required. Please install Erlang or activate via kerl."; exit 1; }

RINGSIZE=64
BACKEND=leveldb

while getopts r:b:h option
do
  case "${option}" in
    r) 
        RINGSIZE=${OPTARG};;
    b) 
        BACKEND=${OPTARG};;
    h) 
        echo "-r     Set ring_creation_size.  Default: 16"
        echo "-b     Set backend, bitcask/leveldb.  Default: bitcask"
        exit 1
        ;;  
  esac
done

RIAKBACKEND='riak_kv_bitcask_backend'

if [ "$BACKEND" = "leveldb" ]
then
    RIAKBACKEND='riak_kv_eleveldb_backend'
fi

echo "Killing all active Riak processes..."
ps -ef | grep beam | grep -v grep | awk '{print $2}' | xargs kill -9
ulimit -n 4096

git clone http://github.com/basho/riak_ee
cd riak_ee
git checkout riak_ee-1.4.3
make rel
make devrel DEVNODES=10

NODE=1
while [ $NODE -lt 11 ]; do
    # Only copy patches for 1.4.2
    #echo Copying Patches to basho-patches for Node $NODE
    #cp ../riak_repl_09292013/riak_repl2_rtq.beam dev/dev$NODE/lib/basho-patches/riak_repl2_rtq.beam
    #cp ../riak_repl_09292013/riak_repl2_rtsink_conn.beam dev/dev$NODE/lib/basho-patches/riak_repl2_rtsink_conn.beam
    #cp ../riak_repl_09292013/riak_repl2_rtsource_conn.beam dev/dev$NODE/lib/basho-patches/riak_repl2_rtsource_conn.beam
    echo Setting Ring Size to $RINGSIZE for Node $NODE
    sed -i .bk "s/%{ring_creation_size, 64}/{ring_creation_size, $RINGSIZE}/g" dev/dev$NODE/etc/app.config
    echo Setting backend to $RIAKBACKEND for Node $NODE
    sed -i .bk "s/riak_kv_bitcask_backend/$RIAKBACKEND/g" dev/dev$NODE/etc/app.config
    echo Disabling swfi in vm.args
    sed -i .bk "s/+sfwi/#+sfwi/g" dev/dev$NODE/etc/vm.args
    echo Starting Node $NODE
    dev/dev$NODE/bin/riak start
    let NODE=NODE+1
done

# Join nodes to clusters
dev/dev2/bin/riak-admin cluster join dev1@127.0.0.1
dev/dev3/bin/riak-admin cluster join dev1@127.0.0.1
dev/dev4/bin/riak-admin cluster join dev1@127.0.0.1

dev/dev7/bin/riak-admin cluster join dev6@127.0.0.1
dev/dev8/bin/riak-admin cluster join dev6@127.0.0.1
dev/dev9/bin/riak-admin cluster join dev6@127.0.0.1

# Plan and commit cluster changes
dev/dev1/bin/riak-admin cluster plan
dev/dev1/bin/riak-admin cluster commit

dev/dev6/bin/riak-admin cluster plan
dev/dev6/bin/riak-admin cluster commit

# Name clusters
dev/dev1/bin/riak-repl clustername XV
dev/dev6/bin/riak-repl clustername CA

# Connect clusters
dev/dev1/bin/riak-repl connect 127.0.0.1:10066

# Virgina <-> California
dev/dev1/bin/riak-repl realtime enable CA
dev/dev1/bin/riak-repl realtime start CA

dev/dev6/bin/riak-repl realtime enable XV
dev/dev6/bin/riak-repl realtime start XV

Once set up, I added dev5 to dev1's cluster, planned and committed the changes. Even once transfers had completed and both ringready and ring-status reported the ring to be ready, the source remained disconnected from the sink.

Also of note, a riak-repl connections run from each node in the cluster returns no connections once the new node is added.

Before adding the node:

Connection          Cluster Name        <Ctrl-Pid>      [Members]
----------          ------------        ----------      ---------
CA                  CA                  <21948.2667.0>  ["127.0.0.1:10066",
                                                          "127.0.0.1:10076",
                                                          "127.0.0.1:10086",
                                                          "127.0.0.1:10096"] (via 127.0.0.1:10096)

After adding the node (join, plan, commit)

Connection          Cluster Name        <Ctrl-Pid>      [Members]
----------          ------------        ----------      ---------

jcapricebasho commented 10 years ago

https://github.com/basho/riak_ee-issues/issues/11

cmeiklejohn commented 10 years ago

https://github.com/basho/riak_test/pull/459, does this look like a good start? We should comment here, and not on the riak_test PR since that's a public repository.

bsparrow435 commented 10 years ago

I have replicated this locally and have some further data to share. Debug logs on the joining node show:

2013-12-04 23:35:04.548 [debug] <0.2069.0>@riak_core_cluster_mgr:handle_call:292 Rebalancer: [] -> []
2013-12-04 23:35:04.548 [debug] <0.2067.0>@riak_core_connection_mgr:start_request:453 Connection Manager located no endpoints for: {rt_repl,"B"}. Will retry.
2013-12-04 23:35:09.549 [debug] <0.2069.0>@riak_core_cluster_mgr:handle_call:292 Rebalancer: [] -> []
2013-12-04 23:35:09.549 [debug] <0.2067.0>@riak_core_connection_mgr:start_request:453 Connection Manager located no endpoints for: {rt_repl,"B"}. Will retry.
2013-12-04 23:35:14.550 [debug] <0.2069.0>@riak_core_cluster_mgr:handle_call:292 Rebalancer: [] -> []

Tracing this call it looks like the riak_core_cluster_mgr State#state.cluster is not being updated correctly. Confirmed locally with sys:get_status(ClusterMgrPID):

GOOD NODE:

{status,<0.2069.0>,
        {module,gen_server},
        [[{'$ancestors',[riak_core_cluster_mgr_sup,riak_repl_sup,
                         <0.522.0>]},
          {'$initial_call',{riak_core_cluster_mgr,init,1}}],
         running,<0.1860.0>,[],
         [{header,"Status for generic server riak_core_cluster_manager"},
          {data,[{"Status",running},
                 {"Parent",<0.1860.0>},
                 {"Logged events",[]}]},
          {data,[{"State",
                  {state,false,'143ee-node2@127.0.0.1',infinity,
                         #Fun<riak_repl_app.2.75412403>,
                         #Fun<riak_repl_app.12.75412403>,
                         #Fun<riak_repl_app.11.75412403>,
                         #Fun<riak_core_cluster_mgr.1.56747935>,
                         [{"B",
                           {cluster,"B",
                                    [{"127.0.0.1",9082}],
                                    {1386,217106,694663}}}]}}]}]]}

BAD  NODE:

{status,<0.2069.0>,
        {module,gen_server},
        [[{'$ancestors',[riak_core_cluster_mgr_sup,riak_repl_sup,
                         <0.522.0>]},
          {'$initial_call',{riak_core_cluster_mgr,init,1}}],
         running,<0.1860.0>,[],
         [{header,"Status for generic server riak_core_cluster_manager"},
          {data,[{"Status",running},
                 {"Parent",<0.1860.0>},
                 {"Logged events",[]}]},
          {data,[{"State",
                  {state,true,'143ee-node2@127.0.0.1',infinity,
                         #Fun<riak_repl_app.2.75412403>,
                         #Fun<riak_repl_app.12.75412403>,
                         #Fun<riak_repl_app.11.75412403>,
                         #Fun<riak_core_cluster_mgr.1.56747935>,
                         [{"B",{cluster,"B",[],{1386,217859,492664}}}]}}]}]]}

This issue is resolved on restart so the init is correctly setting the state.

bsparrow435 commented 10 years ago

Adding backtrace info:

BAD NODE:

143ee-node2@127.0.0.1)14> io:format("~s\n", [element(2, process_info(whereis(riak_core_connection_manager), backtrace))])
(143ee-node2@127.0.0.1)14> .
Program counter: 0x0000000104040c10 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00000001061e0d48 Return addr 0x0000000101f8d1b0 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     riak_core_connection_mgr
y(3)     {state,false,[{req,#Ref<0.0.0.18484>,undefined,{rt_repl,"B"},{{realtime,[{2,0},{1,5}]},{[{keepalive,true},{nodelay,true},{packet,0},{active,false}],riak_repl2_rtsource_conn,<0.4098.0>}},default,undefined,init,undefined}],[],[{cluster_by_addr,#Fun<riak_core_cluster_mgr.23.56747935>},{cluster_by_name,#Fun<riak_repl_app.10.75412403>},{identity,#Fun<riak_core_connection_mgr.0.36248387>},{proxy_get,#Fun<riak_repl2_pg.3.107618181>},{rt_repl,#Fun<riak_repl2_rt.2.44567336>}],0,0}
y(4)     riak_core_connection_manager
y(5)     <0.1860.0>

0x00000001061e0d80 Return addr 0x0000000101852568 (<terminate process normally>)
y(0)     Catch 0x0000000101f8d1d0 (proc_lib:init_p_do_apply/3 + 88)

GOOD NODE:

143ee-node1@127.0.0.1)1> io:format("~s\n", [element(2, process_info(whereis(riak_core_connection_manager), backtrace))])
(143ee-node1@127.0.0.1)1> .
Program counter: 0x0000000109436c10 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0

0x000000010a134998 Return addr 0x00000001073831b0 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     riak_core_connection_mgr
y(3)     {state,false,[],[{{"0.0.0.0",9082},{ep,{"0.0.0.0",9082},0,1,0,false,0,[],undefined,0}},{{"127.0.0.1",9082},{ep,{"127.0.0.1",9082},0,2,0,false,0,[],undefined,0}}],[{cluster_by_addr,#Fun<riak_core_cluster_mgr.23.56747935>},{cluster_by_name,#Fun<riak_repl_app.10.75412403>},{identity,#Fun<riak_core_connection_mgr.0.36248387>},{proxy_get,#Fun<riak_repl2_pg.3.107618181>},{rt_repl,#Fun<riak_repl2_rt.2.44567336>}],0,0}
y(4)     riak_core_connection_manager
y(5)     <0.1860.0>

0x000000010a1349d0 Return addr 0x0000000106c48568 (<terminate process normally>)
y(0)     Catch 0x00000001073831d0 (proc_lib:init_p_do_apply/3 + 88)

ok

cmeiklejohn commented 10 years ago

It took about 4 hours and 30 r_t runs but I got a good log of the problem now. As far as I can tell, the issue is a race related to the ring converging and the node becoming fully joined after all of the elections have occurred, so the new node never ever triggers connecting to the newly learned sink. If this is the case, the proposed fix of adding a connect_to_clusters call in the "still the leader" phase of the cluster manager is not the correct fix.

➜  riak_test git:(feature/csm/verify-sink-connections) ✗ tail -f ~/rt/riak_ee/current/dev/dev5/log/console.log
2013-12-07 02:58:31.486 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_kv,handoff_data_encoding} = encode_raw
2013-12-07 02:58:31.499 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_kv,object_format} = v1
2013-12-07 02:58:31.516 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_kv,secondary_index_version} = v2
2013-12-07 02:58:31.534 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_kv,vclock_data_encoding} = encode_zlib
2013-12-07 02:58:31.553 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_kv,crdt} = [pncounter]
2013-12-07 02:58:31.748 [info] <0.7.0> Application riak_kv started on node 'dev5@127.0.0.1'
2013-12-07 02:58:32.275 [info] <0.7.0> Application merge_index started on node 'dev5@127.0.0.1'
2013-12-07 02:58:32.408 [info] <0.7.0> Application riak_search started on node 'dev5@127.0.0.1'
2013-12-07 02:58:32.549 [info] <0.7.0> Application ranch started on node 'dev5@127.0.0.1'
2013-12-07 02:58:32.568 [info] <0.622.0>@riak_core:wait_for_service:470 Waiting for service riak_kv to start (0 seconds)
2013-12-07 02:59:25.294 [info] <0.458.0>@riak_core:wait_for_service:464 Wait complete for service riak_kv (47 seconds)
2013-12-07 02:59:25.294 [info] <0.622.0>@riak_core:wait_for_service:464 Wait complete for service riak_kv (46 seconds)
2013-12-07 02:59:25.913 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_repl,bloom_fold} = true
2013-12-07 02:59:26.277 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_repl,rtq_meta} = true
2013-12-07 02:59:26.881 [info] <0.622.0>@riak_repl_app:start:55 REPL CS block skip disabled
2013-12-07 02:59:29.628 [info] <0.2237.0>@riak_core_service_mgr:start_link:108 Starting Core Service Manager at {"127.0.0.1",10056}
2013-12-07 02:59:30.292 [info] <0.2233.0>@riak_repl2_leader:become_leader:214 Leader2: 'dev5@127.0.0.1' elected as replication leader
2013-12-07 02:59:30.292 [info] <0.2233.0>@riak_repl2_leader:handle_call:113 calling notifyfuns: []
2013-12-07 02:59:32.774 [info] <0.2245.0>@riak_core_service_mgr:start_dispatcher:497 Service manager: listening on 127.0.0.1:10056
2013-12-07 02:59:34.135 [info] <0.2489.0>@riak_core_cluster_mgr:become_leader:700 ClusterManager: 'dev5@127.0.0.1' becoming the leader
2013-12-07 02:59:34.182 [info] <0.2492.0>@riak_core_tcp_mon:init:168 Starting TCP Monitor
2013-12-07 02:59:42.058 [info] <0.2233.0>@riak_repl2_fscoordinator_sup:set_leader:14 fscoordinator_sup set leader triggered
2013-12-07 02:59:42.058 [info] <0.2489.0>@riak_core_cluster_mgr:become_leader:708 ClusterManager: 'dev5@127.0.0.1' still the leader
2013-12-07 02:59:46.970 [info] <0.7.0> Application riak_repl started on node 'dev5@127.0.0.1'
2013-12-07 02:59:48.268 [info] <0.7.0> Application riak_snmp started on node 'dev5@127.0.0.1'
2013-12-07 02:59:48.715 [info] <0.7.0> Application riak_jmx started on node 'dev5@127.0.0.1'
2013-12-07 02:59:49.151 [info] <0.7.0> Application cluster_info started on node 'dev5@127.0.0.1'
2013-12-07 02:59:49.548 [info] <0.207.0>@riak_core_capability:process_capability_changes:530 New capability: {riak_control,member_info_version} = v1
2013-12-07 02:59:49.841 [info] <0.7.0> Application riak_control started on node 'dev5@127.0.0.1'
2013-12-07 02:59:49.842 [info] <0.7.0> Application erlydtl started on node 'dev5@127.0.0.1'
2013-12-07 02:59:53.318 [info] <0.199.0>@riak_repl2_rtsource_conn_sup:enable:15 Starting replication realtime source "B"
2013-12-07 02:59:53.334 [info] <0.2233.0>@riak_repl2_leader:become_leader:214 Leader2: 'dev5@127.0.0.1' elected as replication leader
2013-12-07 02:59:53.334 [info] <0.2233.0>@riak_repl2_leader:handle_call:113 calling notifyfuns: [#Fun<riak_repl2_fscoordinator_sup.set_leader.2>,#Fun<riak_core_cluster_mgr.set_leader.2>]
2013-12-07 02:59:53.334 [info] <0.2233.0>@riak_repl2_fscoordinator_sup:set_leader:14 fscoordinator_sup set leader triggered
2013-12-07 02:59:53.334 [info] <0.2489.0>@riak_core_cluster_mgr:become_leader:708 ClusterManager: 'dev5@127.0.0.1' still the leader
2013-12-07 02:59:53.420 [info] <0.199.0>@riak_repl2_rtsource_conn_sup:disable:20 Stopping replication realtime source "B"
2013-12-07 02:59:53.485 [info] <0.2233.0>@riak_repl2_leader:become_leader:209 LeaderII: 'dev5@127.0.0.1' re-elected as replication leader
2013-12-07 02:59:53.485 [info] <0.2233.0>@riak_repl2_leader:handle_call:113 calling notifyfuns: [#Fun<riak_repl2_fscoordinator_sup.set_leader.2>,#Fun<riak_core_cluster_mgr.set_leader.2>]
2013-12-07 02:59:53.485 [info] <0.2233.0>@riak_repl2_fscoordinator_sup:set_leader:14 fscoordinator_sup set leader triggered
2013-12-07 02:59:53.485 [info] <0.2489.0>@riak_core_cluster_mgr:become_leader:708 ClusterManager: 'dev5@127.0.0.1' still the leader
2013-12-07 02:59:58.335 [info] <0.2233.0>@riak_repl2_leader:become_leader:209 LeaderII: 'dev5@127.0.0.1' re-elected as replication leader
2013-12-07 02:59:58.335 [info] <0.2233.0>@riak_repl2_leader:handle_call:113 calling notifyfuns: [#Fun<riak_repl2_fscoordinator_sup.set_leader.2>,#Fun<riak_core_cluster_mgr.set_leader.2>]
2013-12-07 02:59:58.335 [info] <0.2233.0>@riak_repl2_fscoordinator_sup:set_leader:14 fscoordinator_sup set leader triggered
2013-12-07 02:59:58.335 [info] <0.2489.0>@riak_core_cluster_mgr:become_leader:708 ClusterManager: 'dev5@127.0.0.1' still the leader
2013-12-07 03:00:01.687 [info] <0.200.0>@riak_core_gossip:log_membership_changes:378 'dev5@127.0.0.1' changed from 'joining' to 'valid'

cmeiklejohn commented 10 years ago

1.4.4 will include a patch which resolves this in #451.

basho / riak_repl

Node added to cluster with realtime replication enabled and started does not connect to sink #449