Closed JeetKunDoug closed 8 years ago
I'm not sure if I'm convinced this is a good idea. Here's my concern:
If we get {'DOWN', ..., noproc}
, then sure, maybe the vnode deliberately shut down after completing handoff: in that case we can treat it the same as {'DOWN', ..., normal}
.
But, it's also possible that our work caused it to crash with some non-normal exit status. In that case, I think we would want to log an error and retry on the next preflist (see queue_work_erracc
).
I feel like we should probably be starting the monitor before we try to send work to the vnode, instead of waiting until queue_work_wait
. That way, if we get noproc
we know the vnode had already shut down before we sent it work, and if our work causes the vnode to crash, we'll definitely get a non-normal 'DOWN'
message.
Will push WIP fixes to this branch, but we're going to hold off on the fix until 2.2.0 given it's been around for 3 years. Will close this PR in the mean time.
Previously, queue_work_wait would handle a normal exit from a vnode which was handing off, but would not actually detect the case that the vnode exited before the erlang:monitor call (which then sends a 'DOWN' message with
noproc
as its reason instead ofnormal
. Add handling of thenoproc
case and forward the message as well.