basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.94k stars 536 forks source link

node_put_fsm_active maxing out [JIRA: RIAK-2268] #789

Open cread opened 8 years ago

cread commented 8 years ago

We've been building riak from source for a while, but I've had trouble getting the 2.1.x releases built reliably and so would like to revert back to using the .deb package. The problem I have is that in our test environment we always manage to max out node_put_fsm_active under sustained write loads, and they never drop, even when disk activity drops to zero.

When running riak 2.0.4 on R16B02-basho5 (our current prod version) everything is working as expected.

Using the .deb package of 2.0.4 pushes us to R16B02_basho6, which is where we see the problem arrive of node_puts_fsm_active going up and never dropping back own again, even after the write load stops.

Further testing with the riak 2.0.6 and 2.1.1 .deb packages (both contain R16B02_basho8) show the same problem.

As the one that works is using hipe as the VM, and the one that locks up does not, could it be a VM difference?

Some technical details of the hardware:

Select riak-admin status output

System that works:

sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
sys_system_version : <<"Erlang R16B02-basho5 (erts-5.10.3) [source] [64-bit] [smp:24:24] [async-threads:64] [hipe] [kernel-poll:true]">>
riak_auth_mods_version : <<"2.0.0-1-g9ae39fe">>
erlydtl_version : <<"0.7.0">>
riak_control_version : <<"2.0.0-3-ge6f9b10">>
cluster_info_version : <<"2.0.0-0-g1fae829">>
yokozuna_version : <<"2.0.0-34-g122659d">>
ibrowse_version : <<"4.0.1">>
riak_search_version : <<"2.0.0-7-g86bfb89">>
merge_index_version : <<"2.0.0-0-gb701dde">>
riak_kv_version : <<"2.0.2-42-g757111c">>
riak_api_version : <<"2.0.1-6-g618cab1">>
riak_pb_version : <<"2.0.0.16-0-gfc18a9b">>
protobuffs_version : <<"0.8.1p4-0-g5257dfe">>
riak_dt_version : <<"2.0.0-22-g5dd5307">>
sidejob_version : <<"2.0.0-0-gc5aabba">>
riak_pipe_version : <<"2.0.0-10-gc8fc8da">>
riak_core_version : <<"2.0.1-100-gb2103c2">>
exometer_core_version : <<"1.0.0-basho1-0-ga46aa8b">>
poolboy_version : <<"0.8.1p2-0-g84d836a">>
pbkdf2_version : <<"2.0.0-0-g7076584">>
eleveldb_version : <<"2.0.0-3-g10fcb78">>
clique_version : <<"0.2.5-0-g3af4db8">>
bitcask_version : <<"1.7.0">>
basho_stats_version : <<"1.0.3">>
webmachine_version : <<"1.10.5-0-ge5f8233">>
mochiweb_version : <<"1.5.1p6">>
inets_version : <<"5.9.6">>
erlang_js_version : <<"1.3.0-0-g07467d8">>
runtime_tools_version : <<"1.8.12">>
os_mon_version : <<"2.2.13">>
riak_sysmon_version : <<"2.0.0">>
ssl_version : <<"5.3.1">>
public_key_version : <<"0.20">>
crypto_version : <<"3.1">>
asn1_version : <<"2.0.3">>
sasl_version : <<"2.3.3">>
lager_version : <<"2.0.3">>
syslog_version : <<"1.0.2">>
goldrush_version : <<"0.1.6">>
compiler_version : <<"4.9.3">>
syntax_tools_version : <<"1.6.11">>
stdlib_version : <<"1.19.3">>
kernel_version : <<"2.16.3">>

System that locks up:

sys_system_architecture : <<"x86_64-unknown-linux-gnu">>
sys_system_version : <<"Erlang R16B02_basho6 (erts-5.10.3) [source-bcd8abb] [64-bit] [smp:24:24] [async-threads:64] [kernel-poll:true] [frame-pointer]">>
riak_auth_mods_version : <<"2.0.0-1-g9ae39fe">>
erlydtl_version : <<"0.7.0">>
riak_control_version : <<"2.0.0-1-gadc50dc">>
cluster_info_version : <<"2.0.0-0-g1fae829">>
yokozuna_version : <<"2.0.0-24-g1e4dba8">>
ibrowse_version : <<"4.0.1">>
riak_search_version : <<"2.0.0-7-g86bfb89">>
merge_index_version : <<"2.0.0-0-gb701dde">>
riak_kv_version : <<"2.0.2-30-g23de41a">>
riak_api_version : <<"2.0.1-6-g618cab1">>
riak_pb_version : <<"2.0.0.16-0-gfc18a9b">>
protobuffs_version : <<"0.8.1p4-0-g5257dfe">>
riak_dt_version : <<"2.0.0-11-g38797a0">>
sidejob_version : <<"2.0.0-0-gc5aabba">>
riak_pipe_version : <<"2.0.0-8-g4662392">>
riak_core_version : <<"2.0.1-93-g0562e0b">>
exometer_core_version : <<"1.0.0-basho1-0-ga46aa8b">>
poolboy_version : <<"0.8.1p2-0-g84d836a">>
pbkdf2_version : <<"2.0.0-0-g7076584">>
eleveldb_version : <<"2.0.0-3-g10fcb78">>
clique_version : <<"0.2.4-0-g14a0966">>
bitcask_version : <<"1.7.0">>
basho_stats_version : <<"1.0.3">>
webmachine_version : <<"1.10.5-0-ge5f8233">>
mochiweb_version : <<"1.5.1p6">>
inets_version : <<"5.9.6">>
erlang_js_version : <<"1.3.0-0-g07467d8">>
runtime_tools_version : <<"1.8.12">>
os_mon_version : <<"2.2.13">>
riak_sysmon_version : <<"2.0.0">>
ssl_version : <<"5.3.1">>
public_key_version : <<"0.20">>
crypto_version : <<"3.1">>
asn1_version : <<"2.0.3">>
sasl_version : <<"2.3.3">>
lager_version : <<"2.0.3">>
syslog_version : <<"1.0.2">>
goldrush_version : <<"0.1.6">>
compiler_version : <<"4.9.3">>
syntax_tools_version : <<"1.6.11">>
stdlib_version : <<"1.19.3">>
kernel_version : <<"2.16.3">>
macintux commented 8 years ago

Thanks for this report, and the detail. I know we've been a bit wary about HIPE; when I saw it mentioned I fully expected the failing node to be using it.

https://github.com/basho/otp/pull/3 is the only change between basho5 and basho6.

tburghart commented 8 years ago

Empirically, HiPE _seems_ to be ok on 64-bit Linux, but I'm not going to go out on a limb and recommend it - the problem is that if it breaks, it won't generally do it until the system is pretty well loaded, so the results can be catastrophic.

On careful review of the above there are version differences in several key components, so they can't both be from the 2.0.4 tag. Are these Basho release packages?

cread commented 8 years ago

The packages that are giving me problems are the Basho released .deb packages available from http://docs.basho.com/riak/latest/downloads/

We've been running the HiPE enabled build in production since late January this year, which I built from the 2.0.4 tarball downloaded from the above URL. I built them following the Installing from Source instructions for 2.0.0.

cread commented 8 years ago

I've just done a test with my own build of 2.1.1 from source. The only difference between this build and the .deb packages is that my build has HiPE enabled. We see the same behavior.

Just in case it matters, here are the properties for the bucket we're writing to:

{
  "props": {
    "young_vclock": 20,
    "write_once": false,
    "w": "quorum",
    "small_vclock": 50,
    "rw": "quorum",
    "r": 1,
    "last_write_wins": false,
    "dw": "quorum",
    "dvv_enabled": false,
    "chash_keyfun": {
      "fun": "chash_std_keyfun",
      "mod": "riak_core_util"
    },
    "big_vclock": 50,
    "basic_quorum": false,
    "allow_mult": false,
    "name": "r",
    "linkfun": {
      "fun": "mapreduce_linkfun",
      "mod": "riak_kv_wm_link_walker"
    },
    "n_val": 3,
    "notfound_ok": false,
    "old_vclock": 86400,
    "postcommit": [],
    "pr": 0,
    "precommit": [],
    "pw": 0
  }
}

I'm going to see if I can reproduce this on a single node now.

cread commented 8 years ago

I've been able to reproduce the problem using basho_bench on our test cluster. Here's the basho_bench configuration:

{mode, max}.
{duration, 1440}.
{report_interval, 1}.
{concurrent, 192}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator, {int_to_bin_bigendian, {partitioned_sequential_int, 4294967296}}}.
{value_generator, {uniform_bin, 2048, 1048576}}.
{riakc_pb_ips, [
        {10,1,1,1},
        {10,1,1,2},
        {10,1,1,3},
        {10,1,1,4},
        {10,1,1,5}
]}.
{riakc_pb_replies, 1}.
{operations, [{put, 1}]}.
{pb_connect_options, [{auto_reconnect, true}]}.
{pb_timeout_general, 30000}.
{pb_timeout_read, 5000}.
{pb_timeout_write, 5000}.
{pb_timeout_listkeys, 50000}.

Within 2 hours or so you'll see node_put_fsm_active start creeping up. After it passes 1000 or so you can stop the load and that value does not drop.

Here's the put_fsm stats from one of the nodes once the load has been removed:

late_put_fsm_coordinator_ack : 941
node_put_fsm_active : 626
node_put_fsm_active_60s : 0
node_put_fsm_counter_time_100 : 0
node_put_fsm_counter_time_95 : 0
node_put_fsm_counter_time_99 : 0
node_put_fsm_counter_time_mean : 0
node_put_fsm_counter_time_median : 0
node_put_fsm_in_rate : 0
node_put_fsm_map_time_100 : 0
node_put_fsm_map_time_95 : 0
node_put_fsm_map_time_99 : 0
node_put_fsm_map_time_mean : 0
node_put_fsm_map_time_median : 0
node_put_fsm_out_rate : 0
node_put_fsm_rejected : 0
node_put_fsm_rejected_60s : 0
node_put_fsm_rejected_total : 0
node_put_fsm_set_time_100 : 0
node_put_fsm_set_time_95 : 0
node_put_fsm_set_time_99 : 0
node_put_fsm_set_time_mean : 0
node_put_fsm_set_time_median : 0
node_put_fsm_time_100 : 0
node_put_fsm_time_95 : 0
node_put_fsm_time_99 : 0
node_put_fsm_time_mean : 0
node_put_fsm_time_median : 0
vnode_put_fsm_time_100 : 0
vnode_put_fsm_time_95 : 0
vnode_put_fsm_time_99 : 0
vnode_put_fsm_time_mean : 0
vnode_put_fsm_time_median : 0

I'm now running this configuration against a single machine to see if I can reproduce it there.

cread commented 8 years ago

Ran the same script on a single node for 8 hours without being able to reproduce the lock up which I guess makes this a cluster problem.

cread commented 8 years ago

Bump.

tburghart commented 8 years ago

Hi @cread

I continue to believe what you're seeing is not directly related to HiPE, but I'm wondering if you're seeing the same behavior across versions with and without a HiPE-enabled VM? In particular, HiPE (when it works) does optimize some operations, so could this be related to being CPU-bound?

Are you seeing your CPUs maxed out (or nearly so) without HiPE, but with lower loads when HiPE's enabled?

cread commented 8 years ago

Correct, this is not HiPE related, as I pointed out in my first comment on 11/11. I get the same bad behavior using Riak 2.1.1 using the .deb package for Ubuntu 14.04 as built by Basho (no HiPE) and using a build I did from the source tarball with HiPE. The CPU's do not max out with either build.

Have you been able to reproduce the lockup using the basho_bench config I provided earlier?

tburghart commented 8 years ago

@cread, can you confirm whether your kernel is tuned in accordance with the "Virtual Memory" and "Schedulers" sections of our System Performance Tuning docs?

If it's not, can you try those settings? Given the symptoms you describe (which we're working to reproduce), it's clearly not the direct problem, but I'd like to eliminate (or identify) it as a contributor that might be driving the node into the state you're seeing.

cread commented 8 years ago

@tburghart I can confirm we're tuned as per the doc.

tburghart commented 8 years ago

Thanks, just trying to cover all the bases

cread commented 8 years ago

@tburghart have you guys been able to reproduce this yet?

cread commented 8 years ago

Ping! Anyone there? Anyone looking at this?

macintux commented 8 years ago

One of our engineers attempted to reproduce and failed, but he wanted to try again more rigorously.

Unfortunately he’s been extremely busy. I hoped he’d have time this week, but he’s been out sick.

-John

cread commented 8 years ago

Thanks John! Let me know if you need any more help or info from me getting to the bottom of this...