riak_kv_http_cache hangs under load [JIRA: RIAK-3074]

nerophon commented 7 years ago

RiakKV stats can become stuck under high load.

Polling persistently fails as follows:

<html><head><title>500 Internal Server Error</title></head><body><h1>Internal Server Error</h1>The server encountered an error while processing this request:<br><pre>{error, 
{exit, 
{timeout,{gen_server,call,[riak_kv_http_cache,get_stats]}}, 
[{gen_server,call,2,[{file,"gen_server.erl"},{line,180}]}, 
{riak_kv_wm_stats,produce_body,2, 
[{file,"src/riak_kv_wm_stats.erl"},{line,77}]}, 
{webmachine_resource,resource_call,3, 
[{file,"src/webmachine_resource.erl"},{line,186}]}, 
{webmachine_resource,do,3, 
[{file,"src/webmachine_resource.erl"},{line,142}]}, 
{webmachine_decision_core,resource_call,1, 
[{file,"src/webmachine_decision_core.erl"},{line,48}]}, 
{webmachine_decision_core,decision,1, 
[{file,"src/webmachine_decision_core.erl"},{line,558}]}, 
{webmachine_decision_core,handle_request,2, 
[{file,"src/webmachine_decision_core.erl"},{line,33}]}, 
{webmachine_mochiweb,loop,2, 
[{file,"src/webmachine_mochiweb.erl"},{line,74}]}]}}</pre><P><HR><ADDRESS>mochiweb+webmachine web server</ADDRESS></body></html>

This issue has been reproduced by a customer in RiakKV 2.0.7 and by the client services team in RiakKV 2.2.0. The customer sees it under high production load. We also see it under high test load, specifically a combination of:

upgrade to 2.2.0
AAE tree rebuilds and/or Solr re-indexing
disk schedulers (accidentally) set to cfq

During this test we noticed that responses throughout the cluster were slow and measured constrained disk I/O. We consider these elements potentially causal, but not consequential.

There is a workaround to "unstick" stats as follows:

exit(erlang:whereis(riak_kv_http_cache), kill).

This has been used successfully by us and by the customer. It solves the problem temporarily, but continuing high load causes the issue to recur. At the customer, the frequency of recurrence was five times per day on average.

Basho-JIRA commented 7 years ago

This kind of timeout error is generally caused trying to get vnode statuses, which calls out to each vnode to get a few data points:

1) per_key_epoch information (Counter State for the vnode) 2) Backend status 3) Vnode ID

Originally, we thought this was just to get the backend status/eleveldb-specific stats that could have been done a different way, but it ends up there's more than just backend status that comes out of this call. The problem is it's very likely that this will time out in a heavily loaded system, and we don't handle the timeout case in a reasonable way when we cal riak_kv_status:vnode_status/0, which then causes the cascade of failures to end up where you see the error above. We should handle a timeout on vnode status more gracefully and somehow return other stats, just not the vnode stats.

[posted via JIRA by Douglas Rohrer]

Basho-JIRA commented 7 years ago

After making a riak_test to call /stats once a second, I was unable to create this overload scenario. Additionally, riak_kv_wm_stats:get_stats(), which is called when the cache is too old to handle the stats call, does not call out to the vnode status call. After tracing the code, I'd say we're probably looking at an exometer deadlock which could stall the http_cache waiting for results from the underlying riak_kv_status:aliases() call. More investigation needed.

[posted via JIRA by Brian Sparrow]

Basho-JIRA commented 7 years ago

Will do some larger testing to reproduce and will send notes to CliServ to instruct to get crash dump.

[posted via JIRA by Patricia Brewer]

nerophon commented 7 years ago

Original ticket: 14071.

martinsumner commented 5 years ago

@ramensen please note this with regards to discussion today

basho / riak_kv

riak_kv_http_cache hangs under load [JIRA: RIAK-3074] #1556