{'EXIT', timeout} during the upgrade of a node from 1.2.1 to 1.3.0pre3 [JIRA: RIAK-2401]

During the rewrite of riak_test's loaded_upgrade test, I ran into some timeouts from pipe.

I started with a 4 node devrel cluster of 1.2.1 nodes, and ran some map/reduce load which occasionally timed out. @beerriot said this is ok, and I added a catch for these timeouts in my load generator. These timeouts came back from riakc_pb_socket:mapred/3 as {error, {timeout, _}} and that was great.

After taking down the dev1 node, some other timeouts started rolling in. When that node was taken down, all processes applying load to that node were also killed. The timeouts looked like this:

<<"{\"phase\":0,\"error\":\"{badmatch,{'EXIT',timeout}}\",\"input\":\"{ok,{r_object,<<\\\"bryanitbs\\\">>,<<\\\"6006\\\">>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<\\\"X-Riak-VTag\\\">>,52,108,107,74,87,107,66,115,86,119,81,102,56,111,97,78,113,118,118,80,70,99]],[[<<\\\"index\\\">>]],[],[[<<\\\"X-Riak-Last-Modified\\\">>|{1359,651147,565553}]],[],[]}}},<<\\\"6006\\\">>}],[{<<197,82,177,11,81,10,161,13>>,{1,63526870347}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],...}}},...},...}\",\"type\":\"error\",\"stack\":\"[{riak_core_vnode_proxy,call,2,[{file,\\\"src/riak_core_vnode_proxy.erl\\\"},{line,52}]},{riak_pipe_vnode,queue_work_send,4,[{file,\\\"src/riak_pipe_vnode.erl\\\"},{line,331}]},{riak_pipe_vnode,queue_work_erracc,6,[{file,\\\"src/riak_pipe_vnode.erl\\\"},{line,279}]},{riak_kv_mrc_map,send_results,2,[{file,\\\"src/riak_kv_mrc_map.erl\\\"},{line,232}]},{riak_pipe_vnode_worker,process_input,3,[{file,\\\"src/riak_pipe_vnode_worker.erl\\\"},{line,445}]},{riak_pipe_vnode_worker,wait_for_input,2,[{file,\\\"src/riak_pipe_vnode_work...\\\"},...]},...]\"}">>

Also, increasing the number of pipe workers masks the problem, but I don't think that fixes it, just that I can't apply enough load to see it.

basho / riak_pipe

{'EXIT', timeout} during the upgrade of a node from 1.2.1 to 1.3.0pre3 [JIRA: RIAK-2401] #67