Closed rzezeski closed 10 years ago
Pinging @wbrown, let's focus all inactivity timeout research to this issue now.
So, a few of observations that I have now that I've filled up my database and pushed my nodes to the limit.
PUT
latency when the throttle is eased back due to my inbound data deduplicating data that's already in Riak. I see the average PUT
latency go up to about 65ms
, 99ms
and 216ms
latency.10
, my searches get bottlenecked by all the PUT
s being indexed in Yokozuna. On the client side, I solve a lot of concurrency issues by spinning up multiple socket connections to Riak, and I can do this on a demand basis -- i.e. searches get their own Riak connections. But Yokozuna indexes and updates are essentially sharing the same connection pool.This is excaberated by the multithreaded search result paging code that I have. The basic flow goes:
maxRows
of 100
. This helps get small searches done quicker, if they have small result sets.num_found
key, and if it is more than 100
, we divide the remainder into chunks of ranges over 1000
items.Search performance wise, I get:
13.69ms
.100
maxRows
search, and three 1000
maxRows
searches with an average latency range of 176.90ms
to 242.92ms
per search. This is the three bigger searches being concurrent. This ends up at a total wall clock time of 730ms
to 1072ms
.ibrowse
socket pool gets filled up, this goes up dramatically to about 50s
to 60s
!So, in this case, the ibrowse
load balancing algorithm bites us on both ends -- we don't want more connections in the pool, but when all connections in the pool are busy, it severely impacts actual retrieval of the data via searches.
Full Dataset from a Standing Start Some extra data points -- unloaded, and after the machine has been idle a while, I see:
Searches against the same dataset, different value in one field:
2594
Batches: (1x 100, 3x 1000 )
, Average Latency per Batch: 212ms
, Total Time: 937.92ms
2572
Batches: (1x 100, 3x 1000 )
, Average Latency per Batch: 143ms
, Total Time: 615.38ms
Cluster Status
n=2
replication, using bitcask
backend200GB
of compressed on-disk storage used per node on filer, for actual data set size of 400GB
, breakdown is:wbrown@stratus:/internal/riak/altostratus$ du -skhc ./*
27G ./anti_entropy
135G ./bitcask
689K ./cluster_meta
337K ./kv_vnode
17K ./riak_kv_exchange_fsm
114K ./ring
25G ./yz
15G ./yz_anti_entropy
201G total
59.93mm
documents per node.Current Performance of Data Import I just kicked off another data import job, and wow, my system is pretty hammered, but still performant considering:
PUT
latencies are at 28ms
to 158ms
, median appears to be at around 50ms
.1000
keys PUT
a second, with the following latency figures:node_puts_total : 57504552
node_put_fsm_time_mean : 85170
node_put_fsm_time_median : 1705
node_put_fsm_time_95 : 19494
node_put_fsm_time_99 : 4048624
node_put_fsm_time_100 : 5971479
10K
ZFS read IOPS against the filer over the network.300MB
to 600MB
per second from the filer, per node over NFS, for an aggregate read throughput of 600MB
to 1.2GB
per second. (!!!? Infiniband ho!) I think Solr may be doing its rewrite of the indexes in response to a trigger -- or just reloading data, as it appears to shrink while idle.zswap
really, really helps with Bitcask backends, especially if you set the swappiness low and have a small swap partition (1GB
in this case)Some interesting error messages:
2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]
2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]
Failure to index due to timeouts:
2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,304}]}]
Error in entropy:
2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...
New Test I will probably redo this multi-day dataset reload, with the following changes to see if it improves things:
HugePages
allocation a little -- I've never seen it go above about 2GB of HugePages
HugePages_Total: 3072
HugePages_Free: 80
HugePages_Rsvd: 68
HugePages_Surp: 0
Although the current allocation is because of memory pressure:
2014-04-23 01:56:49.304 [info] <0.515.0>@yz_solr_proc:handle_info:135 solr stdout/err: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a7000000, 35651584, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
Anything you'd like to see changed or tested on my next go-round at this?
@wbrown Sorry for late comments. Trying to keep pace with you :).
Some interesting error messages:
2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}] 2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]
Wow, those message queues are much too large. There should be a throttle mechanism in the AAE code to prevent this overload but perhaps it is not working properly. I also remember you saying in another ticket that you increased your overload threshold for the cluster because of infiniband. I think this is a bad idea. You want to avoid overly large message queues in Erlang. Increasing the overload threshold will just allow more queues to get larger.
Failure to index due to timeouts:
2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_com
This simply indicates that Solr isn't keeping up with the load.
Error in entropy:
2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun
,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...
This is #324, it is an acceptable error as AAE will just retry later.
I'm closing this issue since it was about the inactivity timeout which has been raised for 2.0.0 until a better solution can be done in 2.x.x.
After digging into issues #320, #330, and #358 it was discovered that ibrowse's default
inactivity_timeout
combined with its poor load balancing algorithm and the default pool and pipeline sizes is causing unnecessary socket churn. Even under load the client may not work Yokozuna enough to fill all 10 pipelines to prevent the connection from reaching time out. Increasing the pool or pipeline size just makes things worse as the ibrowse algorithm first wants to fill the pool before using pipelines causing it to make a new connection for almost every request. My comment on #330 goes into a bit more detail and includes evidence à la DTrace [1].Action Items