basho / riak_pipe

Riak Pipelines
Apache License 2.0
162 stars 60 forks source link

Handle noproc errors correctly in queue_work_wait #106

Closed JeetKunDoug closed 8 years ago

JeetKunDoug commented 8 years ago

Previously, queue_work_wait would handle a normal exit from a vnode which was handing off, but would not actually detect the case that the vnode exited before the erlang:monitor call (which then sends a 'DOWN' message with noproc as its reason instead of normal. Add handling of the noproc case and forward the message as well.

nickelization commented 8 years ago

I'm not sure if I'm convinced this is a good idea. Here's my concern:

If we get {'DOWN', ..., noproc}, then sure, maybe the vnode deliberately shut down after completing handoff: in that case we can treat it the same as {'DOWN', ..., normal}.

But, it's also possible that our work caused it to crash with some non-normal exit status. In that case, I think we would want to log an error and retry on the next preflist (see queue_work_erracc).

I feel like we should probably be starting the monitor before we try to send work to the vnode, instead of waiting until queue_work_wait. That way, if we get noproc we know the vnode had already shut down before we sent it work, and if our work causes the vnode to crash, we'll definitely get a non-normal 'DOWN' message.

JeetKunDoug commented 8 years ago

Will push WIP fixes to this branch, but we're going to hold off on the fix until 2.2.0 given it's been around for 3 years. Will close this PR in the mean time.