Increase ibrowse Inactivity Timeout

After digging into issues #320, #330, and #358 it was discovered that ibrowse's default inactivity_timeout combined with its poor load balancing algorithm and the default pool and pipeline sizes is causing unnecessary socket churn. Even under load the client may not work Yokozuna enough to fill all 10 pipelines to prevent the connection from reaching time out. Increasing the pool or pipeline size just makes things worse as the ibrowse algorithm first wants to fill the pool before using pipelines causing it to make a new connection for almost every request. My comment on #330 goes into a bit more detail and includes evidence à la DTrace [1].

Action Items

[x] verify no socket churn with higher timeout (60s)
[x] benchmark before/after data load, curious if reduced churn helps

Pinging @wbrown, let's focus all inactivity timeout research to this issue now.

So, a few of observations that I have now that I've filled up my database and pushed my nodes to the limit.

The socket timeout appears to really kill Riak PUT latency when the throttle is eased back due to my inbound data deduplicating data that's already in Riak. I see the average PUT latency go up to about 65ms, 99ms and 216ms latency.
Another observation is that with the default sockets of 10, my searches get bottlenecked by all the PUTs being indexed in Yokozuna. On the client side, I solve a lot of concurrency issues by spinning up multiple socket connections to Riak, and I can do this on a demand basis -- i.e. searches get their own Riak connections. But Yokozuna indexes and updates are essentially sharing the same connection pool.

This is excaberated by the multithreaded search result paging code that I have. The basic flow goes:

Do a small search against Yokozuna with a maxRows of 100. This helps get small searches done quicker, if they have small result sets.
Retrieve the num_found key, and if it is more than 100, we divide the remainder into chunks of ranges over 1000 items.
A certain amount of threaded search workers is spun up, each worker grabs a chunk, and does a streaming search.
The results are yielded to the search invoker, allowing the resulting keys to be returned as fast as the invoker can process it.

Search performance wise, I get:

For a search the returns one result, I get an average latency of 13.69ms.
With ~2.5K results, which would be about four searches -- one small 100 maxRows search, and three 1000 maxRows searches with an average latency range of 176.90ms to 242.92ms per search. This is the three bigger searches being concurrent. This ends up at a total wall clock time of 730ms to 1072ms.
When Yokozuna's ibrowse socket pool gets filled up, this goes up dramatically to about 50s to 60s!

So, in this case, the ibrowse load balancing algorithm bites us on both ends -- we don't want more connections in the pool, but when all connections in the pool are busy, it severely impacts actual retrieval of the data via searches.

Full Dataset from a Standing Start Some extra data points -- unloaded, and after the machine has been idle a while, I see:

Searches against the same dataset, different value in one field:

Results: 2594 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 212ms, Total Time: 937.92ms
Results: 2572 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 143ms, Total Time: 615.38ms

Cluster Status

Two nodes at n=2 replication, using bitcask backend
200GB of compressed on-disk storage used per node on filer, for actual data set size of 400GB, breakdown is:

wbrown@stratus:/internal/riak/altostratus$ du -skhc ./*
27G ./anti_entropy
135G    ./bitcask
689K    ./cluster_meta
337K    ./kv_vnode
17K ./riak_kv_exchange_fsm
114K    ./ring
25G ./yz
15G ./yz_anti_entropy
201G    total

Solr is at 59.93mm documents per node.

Current Performance of Data Import I just kicked off another data import job, and wow, my system is pretty hammered, but still performant considering:

Riak PUT latencies are at 28ms to 158ms, median appears to be at around 50ms.
Import rates are at about 1000 keys PUT a second, with the following latency figures:

node_puts_total : 57504552
node_put_fsm_time_mean : 85170
node_put_fsm_time_median : 1705
node_put_fsm_time_95 : 19494
node_put_fsm_time_99 : 4048624
node_put_fsm_time_100 : 5971479

Each node is doing 10K ZFS read IOPS against the filer over the network.
Solr is reading 300MB to 600MB per second from the filer, per node over NFS, for an aggregate read throughput of 600MB to 1.2GB per second. (!!!? Infiniband ho!) I think Solr may be doing its rewrite of the indexes in response to a trigger -- or just reloading data, as it appears to shrink while idle.
zswap really, really helps with Bitcask backends, especially if you set the swappiness low and have a small swap partition (1GB in this case)

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]

2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,304}]}]

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

New Test I will probably redo this multi-day dataset reload, with the following changes to see if it improves things:

LevelDB instead of Bitcask -- each Riak process is currently taking 16GB, completely squeezing out the disk cache.
Detune the HugePages allocation a little -- I've never seen it go above about 2GB of HugePages

HugePages_Total:    3072
HugePages_Free:       80
HugePages_Rsvd:       68
HugePages_Surp:        0

Although the current allocation is because of memory pressure:

2014-04-23 01:56:49.304 [info] <0.515.0>@yz_solr_proc:handle_info:135 solr stdout/err: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a7000000, 35651584, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages

Anything you'd like to see changed or tested on my next go-round at this?

@wbrown Sorry for late comments. Trying to keep pace with you :).

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}] 2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Wow, those message queues are much too large. There should be a throttle mechanism in the AAE code to prevent this overload but perhaps it is not working properly. I also remember you saying in another ticket that you increased your overload threshold for the cluster because of infiniband. I think this is a bad idea. You want to avoid overly large message queues in Erlang. Increasing the overload threshold will just allow more queues to get larger.

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_com

This simply indicates that Solr isn't keeping up with the load.

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

This is #324, it is an acceptable error as AAE will just retry later.

I'm closing this issue since it was about the inactivity timeout which has been raised for 2.0.0 until a better solution can be done in 2.x.x.

basho / yokozuna

Increase ibrowse Inactivity Timeout #367

Action Items