Open nerophon opened 7 years ago
This kind of timeout error is generally caused trying to get vnode statuses, which calls out to each vnode to get a few data points:
1) per_key_epoch information (Counter State for the vnode) 2) Backend status 3) Vnode ID
Originally, we thought this was just to get the backend status/eleveldb-specific stats that could have been done a different way, but it ends up there's more than just backend status that comes out of this call. The problem is it's very likely that this will time out in a heavily loaded system, and we don't handle the timeout case in a reasonable way when we cal riak_kv_status:vnode_status/0, which then causes the cascade of failures to end up where you see the error above. We should handle a timeout on vnode status more gracefully and somehow return other stats, just not the vnode stats.
[posted via JIRA by Douglas Rohrer]
After making a riak_test to call /stats
once a second, I was unable to create this overload scenario. Additionally, riak_kv_wm_stats:get_stats()
, which is called when the cache is too old to handle the stats call, does not call out to the vnode status call. After tracing the code, I'd say we're probably looking at an exometer deadlock which could stall the http_cache waiting for results from the underlying riak_kv_status:aliases()
call. More investigation needed.
[posted via JIRA by Brian Sparrow]
Will do some larger testing to reproduce and will send notes to CliServ to instruct to get crash dump.
[posted via JIRA by Patricia Brewer]
Original ticket: 14071.
@ramensen please note this with regards to discussion today
RiakKV stats can become stuck under high load.
Polling persistently fails as follows:
This issue has been reproduced by a customer in RiakKV
2.0.7
and by the client services team in RiakKV2.2.0
. The customer sees it under high production load. We also see it under high test load, specifically a combination of:2.2.0
During this test we noticed that responses throughout the cluster were slow and measured constrained disk I/O. We consider these elements potentially causal, but not consequential.
There is a workaround to "unstick" stats as follows:
This has been used successfully by us and by the customer. It solves the problem temporarily, but continuing high load causes the issue to recur. At the customer, the frequency of recurrence was five times per day on average.