Handoff startup is two steps: Module:handoff_starting, followed by receiving a ?FOLD_REQ message. It is possible for the vnode to receive messages from its workers and other processes between these two steps (thanks, Joe). We don't want to do handoff things until after the second message arrives, though.
This bug was detected by the dynamic cluster + MapReduce test. It was possible for a vnode to have its handoff_starting function called, then receive a next_input request from one of its workers. The vnode, thinking it was in handoff, would tell the worker to archive. If the worker also finished and sent its archive back to the vnode before the ?FOLD_REQ message arrived, a bad_record error would be raised in riak_pipe_vnode:archive_internal/2, because #state.handoff was the atom starting, as set by the handoff_starting function, instead of a #handoff{} record, as it is after the ?FOLD_REQ message is received. This fix prevents the whole mess by maintaining normal, non-handoff operation until ?FOLD_REQ is received (so the response to next_input is the worker's next input, instead of the archive command).
Handoff startup is two steps:
Module:handoff_starting
, followed by receiving a?FOLD_REQ
message. It is possible for the vnode to receive messages from its workers and other processes between these two steps (thanks, Joe). We don't want to do handoff things until after the second message arrives, though.This bug was detected by the dynamic cluster + MapReduce test. It was possible for a vnode to have its
handoff_starting
function called, then receive anext_input
request from one of its workers. The vnode, thinking it was in handoff, would tell the worker to archive. If the worker also finished and sent its archive back to the vnode before the?FOLD_REQ
message arrived, abad_record
error would be raised inriak_pipe_vnode:archive_internal/2
, because#state.handoff
was the atomstarting
, as set by thehandoff_starting
function, instead of a#handoff{}
record, as it is after the?FOLD_REQ
message is received. This fix prevents the whole mess by maintaining normal, non-handoff operation until?FOLD_REQ
is received (so the response tonext_input
is the worker's next input, instead of the archive command).