`vnode_down` errors during handoff

beerriot commented 11 years ago

While handoff is happening, there are sometimes vnode_down errors. This ticket will track various tests/fixes as we work on them.

beerriot commented 11 years ago

Now that we have a reproducing test (pipe_verify_handoff_blocking in basho/riak_test#154), I'll explain a little more about what the trouble is here.

When any process sends an input to a fitting (via riak_pipe:queue_work or riak_pipe_vnode_worker:send_output), that process waits for a response from the vnode it sent the message to, and also sets up a monitor on that vnode. If the monitor triggers before the reply is received, the functions return a vnode_down error.

The problem comes during handoff, in at least two places: the blocking queue, and in-flight forwarding. In both of these cases, the inputs are given to the new vnode for processing, but the monitors continue watching the old vnode. At the end of handoff, the old vnode shuts down, and if the shutdown happens before the new vnode responds to the input sender, the monitor fires, and we see vnode_down.

Fixing this might be tricky.

It's not possible to fix at the pipe-using-application (Riak KV MR) level. The code there could watch for the vnode_down error return, but ignoring it might cause missing results, and resending inputs might cause duplicates.

At the pipe level, we could check the 'DOWN' reason. If the vnode exited normally, assume that the input must have been forwarded correctly, and it does not need to be resent. The main problem is that we don't know where the input was handed off to, so we can't set up a new monitor on that new vnode. And, if the exit is abnormal, we don't know whether the input made it or not.

Another option would be to remove the blocking queue from the handoff archive, and remove the {forward, Status} return from riak_pipe_vnode:handle_handoff_command. Instead, we'd just send errors back to the senders of those inputs, which would cause them to re-queue elsewhere. But, without a new error type (that would indicate 'handoff' and not a 'serious' failure), this would fail in the n_val=1 case, and it doesn't cover the automatic-forwarding behavior of riak_core_vnode anyway.

Though, we might be able to extend this idea by essentially disabling riak_core_vnode's automatic forwarding. We might do this by tagging input messages with the node we thought we were sending them to. On reception, if this tag does not match node(), the vnode rejects it. This still requires a new error type to work around n_val=1. This won't work because the "this was handed off" message still comes from the new vnode, while we're monitoring the old vnode.

beerriot commented 11 years ago

PR #64 takes the strategy of checking the 'DOWN' reason. Thanks to @jrwest for pointing me at riak_core_ring:next_owner. It preserves the vnode_down error type for the case where the vnode's exit is abnormal.

beerriot commented 11 years ago

I believe that the web of referenced issues have all been closed/merged, and that vnode_down during handoff is solved. I'm therefore closing this issue.

basho / riak_pipe

`vnode_down` errors during handoff #63