Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111]

basho / riak_api

Riak Client APIs

27 stars 45 forks source link

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

Open slfritchie opened 11 years ago

slfritchie commented 11 years ago

If a prior PB operation times out, a late-arriving answer can trigger default clause handle_info() error log spam, e.g.

2012-10-21 01:44:14.250 [error] <0.31976.2002>@riak_api_pb_server:handle_info:170 Unrecognized message {51911707,{ok,{r_object,<<"bucket">>, ....

We've a customer who's had intra-cluster communication problems (including busy_dist_port warnings) where 40 or more of these messages would be logged per second.

seancribbs commented 11 years ago

@slfritchie To me this would indicate that the timeout is not terminating the request FSM.

beerriot commented 11 years ago

Just noting that this is likely to eventually cause someone the same problems as seen by @joedevivo on basho/riak_test#79. Namely, this unhandled message may be received during an operation being handled by riak_kv_pb_bucket, which does not have a catch-all ignore case in its process_stream function.

evanmcc commented 11 years ago

The timeout doesn't terminate the request FSM, no.

It seems to me that you have to handle it anyway, though, since even if you're terminating it, there's a race there.

PR#20 doesn't address Bryan's comment.

ixmatus commented 11 years ago

This same exact issue is happening to me; sample message:

[error] <0.20657.797>@riak_api_pb_server:handle_info:170 Unrecognized message {pipe_result,#Ref<0.0.842.47868>,0,[<<"ranktracker_summary_1499">>,<<"2013-01-02_248152">>]}

Followed by hundreds of thousands of lines of:

[error] <0.534.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup

evanmcc commented 11 years ago

updating to 1.2.1 should help with the last one, but it's good to know there's another path there. Note that that's typically a symptom of sending too many MR jobs for your cluster to handle.

ixmatus commented 11 years ago

My whole cluster has been upgraded to 1.2.1 as of three weeks ago...

evanmcc commented 11 years ago

My bad; the patch that I was thinking of will land in 1.3.

ixmatus commented 11 years ago

On 1/7/13 5:13 PM, Evan Vigil-McClanahan wrote:

My bad; the patch that I was thinking of will land in 1.3.

— Reply to this email directly or view it on GitHub https://github.com/basho/riak_api/issues/16#issuecomment-11976943.

Okay, thanks for the info.

seancribbs commented 10 years ago

This can still occur -- we have not addressed the root problem which is incomplete termination (and race-conditions around there) of request FSMs.

nerophon commented 9 years ago

This is still occurring. The customer is using Riak 1.4.12 and Java client 1.4.4. Please see Zendesk ticket #11687 for more details. Example log output:

2015-08-18 02:22:01.882 UTC [error] <0.21975.339>@riak_api_pb_server:handle_info:141 Unrecognized message {8131652,ok}

point9repeating commented 8 years ago

I'm not positive but I may have just experienced this with riak TS 1.3.1:

2016-08-05 21:33:01.492 [error] <0.7481.54>@riak_api_pb_server:handle_info:331 Unrecognized message {ok,{[<<"creation_time">>,<<"start_time">>,<<"sample_period">>,<<"length">>],[ti mestamp,timestamp,sint64,sint64],[[1467746383099,1467752400000,3600,195],[1467746744137,1467752400000,3600,195],[1467747603697,1467756000000,3600,194],[1467747891827,1467756000000,...]}}

We are getting a bunch of these while load testing our cluster to tune our configuration (specifically the value for "timeseries_max_concurrent_queries").