basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.94k stars 536 forks source link

Nodes don't initiate handoffs if `riak-admin vnode-status` is called often [JIRA: RIAK-1907] #754

Open hcs42 opened 9 years ago

hcs42 commented 9 years ago

The problem is easy to reproduce (I tested Riak 1.4.1 and Riak 2.1.1):

  1. Have a cluster with at least 2 nodes.
  2. Stop one node.
  3. Generate some puts so that there is data to be handed off. (I used basho_bench.)
  4. Execute riak-admin transfers. Select a node that has at least one partition to hand off to the stopped node. We will use this as the "source node".
  5. Make sure that riak-admin vnode-status is called on the source node at least once a minute, e.g. by starting the following loop:

    while true; do riak-admin vnode-status; sleep 50; done
  6. Start the stopped node.
  7. If you call bin/riak-admin transfer any time in the future (while the loop still running), it will show that partitions are still waiting to be handed off.
  8. If you stop the loop, the partitions will start being handed off.

The root cause is the following: Inside each Riak node, there is a riak_core_vnode process for each vnode (implemented in the riak_core_vnode.erl module). This is a gen_fsm. If it doesn't receive a request for a minute, then it sends an inactive event to the riak_core_vnode_manager, which triggers any outstading handoffs. The problem is that running riak-admin vnode-status calls riak_kv_status:vnode_status, which then dispatches a request towards the riak_core_vnode process of each vnode. So if riak-admin vnode-status is called at least once a minute, the riak_core_vnode processes will never have a one minute timeout, and will therefore never check if there are handoffs to be performed.

A workaround is to call the riak_core_vnode_manager:force_handoffs() function on the source node, which will start a few handoffs (the number of handoffs will be the same as the value of the transfer-limit setting). If there are more handoffs than the transfer-limit, the function needs to be called multiple times to perform all handoffs.

Notes:

binarytemple commented 9 years ago

Confirmed from our side and @joecaswell also had some comments:

The inactive message only gets sent to the vnode manager when the vnode FSM receives the atom timeout while in the active state.

The active state accepts and ignores any message it does not expect:

This means that if the vnode receives any message (of any type from any sender) at least once per minute, the FSM timeout will never occur so the inactive message will never get sent.

I think resolving this in a resilient manner will involve adding a now-type last action timestamp to the state that is updated when action requests (get,put,fold,index, etc.) are handled.

And having the continue/1 function use something like:

{next_state, active, State, max(1,State#state.inactivity_timeout -
   abs(timer:now_diff(os:timestamp(),State#state.last_action_timestamp)/1000))
}.

Instead of

{next_state, active, State, State#state.inactivity_timeout}.
jadeallenx commented 9 years ago

This sounds like a case where it would be simplest to send a message using a timer with an explicit message using erlang:send_after/3 and then maybe manipulating the FSM state from the info message properly.

jadeallenx commented 9 years ago

Or, I suppose another option would be to piggy-back on the management tick message.