Open slfritchie opened 11 years ago
@slfritchie To me this would indicate that the timeout is not terminating the request FSM.
Just noting that this is likely to eventually cause someone the same problems as seen by @joedevivo on basho/riak_test#79. Namely, this unhandled message may be received during an operation being handled by riak_kv_pb_bucket
, which does not have a catch-all ignore
case in its process_stream
function.
The timeout doesn't terminate the request FSM, no.
It seems to me that you have to handle it anyway, though, since even if you're terminating it, there's a race there.
PR#20 doesn't address Bryan's comment.
This same exact issue is happening to me; sample message:
[error] <0.20657.797>@riak_api_pb_server:handle_info:170 Unrecognized message {pipe_result,#Ref<0.0.842.47868>,0,[<<"ranktracker_summary_1499">>,<<"2013-01-02_248152">>]}
Followed by hundreds of thousands of lines of:
[error] <0.534.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup
updating to 1.2.1 should help with the last one, but it's good to know there's another path there. Note that that's typically a symptom of sending too many MR jobs for your cluster to handle.
My whole cluster has been upgraded to 1.2.1 as of three weeks ago...
My bad; the patch that I was thinking of will land in 1.3.
On 1/7/13 5:13 PM, Evan Vigil-McClanahan wrote:
My bad; the patch that I was thinking of will land in 1.3.
— Reply to this email directly or view it on GitHub https://github.com/basho/riak_api/issues/16#issuecomment-11976943.
Okay, thanks for the info.
This can still occur -- we have not addressed the root problem which is incomplete termination (and race-conditions around there) of request FSMs.
This is still occurring. The customer is using Riak 1.4.12 and Java client 1.4.4. Please see Zendesk ticket #11687 for more details. Example log output:
2015-08-18 02:22:01.882 UTC [error] <0.21975.339>@riak_api_pb_server:handle_info:141 Unrecognized message {8131652,ok}
I'm not positive but I may have just experienced this with riak TS 1.3.1:
2016-08-05 21:33:01.492 [error] <0.7481.54>@riak_api_pb_server:handle_info:331 Unrecognized message {ok,{[<<"creation_time">>,<<"start_time">>,<<"sample_period">>,<<"length">>],[ti mestamp,timestamp,sint64,sint64],[[1467746383099,1467752400000,3600,195],[1467746744137,1467752400000,3600,195],[1467747603697,1467756000000,3600,194],[1467747891827,1467756000000,...]}}
We are getting a bunch of these while load testing our cluster to tune our configuration (specifically the value for "timeseries_max_concurrent_queries").
If a prior PB operation times out, a late-arriving answer can trigger default clause
handle_info()
error log spam, e.g.We've a customer who's had intra-cluster communication problems (including
busy_dist_port
warnings) where 40 or more of these messages would be logged per second.