basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.95k stars 537 forks source link

Riak 2.0.pre20: high CPU usage on idle #527

Closed oleksiyk closed 10 years ago

oleksiyk commented 10 years ago

I'm running a clean install of 5 node Riak 2.0.pre20 cluster on a physical servers with intel xeon hexacore and experiencing constant load on CPU by beam.smp (on all nodes). The load is 7-25%. Server (and Riak) is absolutely idle.

riak-admin top interval 2:

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:40
 Load:  cpu         0               Memory:  total       47946    binary       4826
        procs    1094                        processes   13616    code        12701
        runq        0                        atom          541    ets          6190

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.740.0>        riak_control_session          '-'    8279201     372256          0 gen_server:loop/6
<6315.3.0>          erl_prim_loader               '-'    2876824     142624          0 erl_prim_loader:loop/3
<6315.102.0>        erlang:apply/2                '-'    2158227       2600          0 cpu_sup:measurement_server_loop/1
<6315.188.0>        riak_core_vnode_manager       '-'    1863177     234104          0 gen_server:loop/6
<6315.91.0>         riak_sysmon_filter            '-'     951722       5760          0 gen_server:loop/6
<6315.191.0>        riak_core_capability          '-'     816780      55016          0 gen_server:loop/6
<6315.165.0>        riak_core_ring_manager        '-'     794032     230360          0 gen_server:loop/6
<6315.25.0>         code_server                   '-'     695960     142688          0 code_server:loop/1
<6315.7.0>          application_controller        '-'     529188     143728          0 gen_server:loop/6
<6315.0.0>          init                          '-'     349978       8712          0 init:loop/1

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:41
 Load:  cpu         0               Memory:  total       47894    binary       4826
        procs    1094                        processes   13337    code        12701
        runq        0                        atom          541    ets          6190

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.740.0>        riak_control_session          '-'      26194     372256          0 gen_server:loop/6
<6315.102.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.91.0>         riak_sysmon_filter            '-'       2221       3888          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        875      89544          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        665      55016          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        368       5720          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        366       5720          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        360       5720          0 gen_server:loop/6
<6315.12.0>         rex                           '-'        303      13624          0 gen_server:loop/6
<6315.100.0>        erlang:apply/2                '-'        244       2640          0 memsup:port_idle/1

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:43
 Load:  cpu         0               Memory:  total       47923    binary       4826
        procs    1094                        processes   13367    code        12701
        runq        0                        atom          541    ets          6191

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.102.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.7.0>          application_controller        '-'        865     143728          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        769       8736          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        740       5720          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        740       5720          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        522      34320          0 gen_server:loop/6
<6315.212.0>        riak_core_stat_calc_sup       '-'        306       5960          0 gen_server:loop/6
<6315.91.0>         riak_sysmon_filter            '-'        215       3888          0 gen_server:loop/6
<6315.372.0>        riak_api_pb_sup               '-'        142       2704          0 gen_server:loop/6
<6315.101.0>        cpu_sup                       '-'         67       2704          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:44
 Load:  cpu         0               Memory:  total       47894    binary       4826
        procs    1094                        processes   13336    code        12701
        runq        0                        atom          541    ets          6191

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.740.0>        riak_control_session          '-'      26194     372256          0 gen_server:loop/6
<6315.102.0>        erlang:apply/2                '-'       2351       2600          0 cpu_sup:measurement_server_loop/1
<6315.91.0>         riak_sysmon_filter            '-'       2236       5760          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        668      89544          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        461      55016          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        383       8736          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        368       5720          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        359       5720          0 gen_server:loop/6
<6315.12.0>         rex                           '-'        303      13624          0 gen_server:loop/6
<6315.212.0>        riak_core_stat_calc_sup       '-'        243       5960          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:46
 Load:  cpu         0               Memory:  total       47878    binary       4826
        procs    1094                        processes   13322    code        12701
        runq        0                        atom          541    ets          6191

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.188.0>        riak_core_vnode_manager       '-'      12674     234104          0 gen_server:loop/6
<6315.191.0>        riak_core_capability          '-'       6106      55016          0 gen_server:loop/6
<6315.102.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.492.0>        riak_kv_entropy_manager       '-'       1151      36192          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        882     143728          0 gen_server:loop/6
<6315.192.0>        riak_core_gossip              '-'        823      88504          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        770       8736          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        747       8736          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        740       5720          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        495      34320          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:47
 Load:  cpu         0               Memory:  total       47846    binary       4826
        procs    1094                        processes   13287    code        12701
        runq        0                        atom          541    ets          6191

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.740.0>        riak_control_session          '-'      26194     372256          0 gen_server:loop/6
<6315.102.0>        erlang:apply/2                '-'       2351       2600          0 cpu_sup:measurement_server_loop/1
<6315.91.0>         riak_sysmon_filter            '-'       2218       8776          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        668      89544          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        478      55016          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        368       5720          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        365       5720          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        360       5720          0 gen_server:loop/6
<6315.12.0>         rex                           '-'        334      13624          0 gen_server:loop/6
<6315.212.0>        riak_core_stat_calc_sup       '-'        243       5960          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           07:27:49
 Load:  cpu         0               Memory:  total       47884    binary       4826
        procs    1094                        processes   13325    code        12701
        runq        0                        atom          541    ets          6192

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.102.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.7.0>          application_controller        '-'        875     143728          0 gen_server:loop/6
<6315.464.0>        riak_kv_stat_sj_stats         '-'        769       8736          0 gen_server:loop/6
<6315.432.0>        riak_kv_put_fsm_sj_stats      '-'        740       5720          0 gen_server:loop/6
<6315.448.0>        riak_kv_get_fsm_sj_stats      '-'        724       8736          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_cache          '-'        478      34320          0 gen_server:loop/6
<6315.212.0>        riak_core_stat_calc_sup       '-'        216       5960          0 gen_server:loop/6
<6315.91.0>         riak_sysmon_filter            '-'        206       8776          0 gen_server:loop/6
<6315.372.0>        riak_api_pb_sup               '-'         96       2704          0 gen_server:loop/6
<6315.101.0>        cpu_sup                       '-'         67       2704          0 gen_server:loop/6
^C

There are no errors in console.log or error.log Is that normal?

riak.conf:

log.console = file

log.console.level = info

log.console.file = /var/log/riak/console.log

log.error.file = /var/log/riak/error.log

log.syslog = off

log.crash = on

log.crash.file = /var/log/riak/crash.log

log.crash.maximum_message_size = 64KB

log.crash.size = 10MB

log.crash.rotation = $D0

log.crash.rotation.keep = 5

nodename = riak@10.0.1.1

distributed_cookie = riak

erlang.async_threads = 16

erlang.max_ports = 65536

erlang.distribution.port_range.minimum = 6000
erlang.distribution.port_range.maximum = 7999

erlang.distribution_buffer_size = 67108864

ring_size = 128

ring.state_dir = /var/lib/riak/ring

ssl.certfile = /etc/ssl/STAR__com.crt

ssl.keyfile = /etc/ssl/STAR__com.key

ssl.cacertfile = /etc/ssl/STAR__com.ca-bundle

dtrace = off

listener.http.internal = 127.0.0.1:8098

listener.protobuf.internal = 127.0.0.1:8087

listener.https.internal = 127.0.0.1:8069

anti_entropy = active

storage_backend = leveldb

object.format = 1

metadata_cache_size = off

object.size.warning_threshold = 5MB

object.size.maximum = 50MB

object.siblings.warning_threshold = 25

object.siblings.maximum = 100

bitcask.data_root = /var/lib/riak/bitcask

bitcask.io_mode = erlang

riak_control = on

riak_control.auth.mode = off

riak_control.auth.user.user.password = pass

leveldb.data_root = /var/lib/riak/leveldb

leveldb.maximum_memory.percent = 80

leveldb.compaction.trigger.tombstone_count = 1000

search = off

search.solr.start_timeout = 30s

search.solr.port = 8093

search.solr.jmx_port = 8985

search.solr.jvm_options = -Xms1g -Xmx1g -XX:+UseStringCache -XX:+UseCompressedOops

search.anti_entropy.data_dir = /var/lib/riak/yz_anti_entropy

search.root_dir = /var/lib/riak/yz
oleksiyk commented 10 years ago

Disabling riak_control=off helped a bit, but still the load is constantly 3-20%:

===============================================================================================================================
 'riak@10.0.1.1'                                                           08:35:45
 Load:  cpu         0               Memory:  total       47623    binary       4864
        procs    1093                        processes   13267    code        12629
        runq        0                        atom          541    ets          6226

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.104.0>        erlang:apply/2                '-'    7170561       2600          0 cpu_sup:measurement_server_loop/1
<6315.191.0>        riak_core_vnode_manager       '-'    4730615     234104          0 gen_server:loop/6
<6315.93.0>         riak_sysmon_filter            '-'    3282618       5760          0 gen_server:loop/6
<6315.3.0>          erl_prim_loader               '-'    2781141     142624          0 erl_prim_loader:loop/3
<6315.194.0>        riak_core_capability          '-'    2139023      55016          0 gen_server:loop/6
<6315.165.0>        riak_core_ring_manager        '-'    1666166     230360          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'    1626588      89544          0 gen_server:loop/6
<6315.485.0>        riak_kv_stat_sj_stats         '-'    1153005       8736          0 gen_server:loop/6
<6315.453.0>        riak_kv_put_fsm_sj_stats      '-'    1142398       5720          0 gen_server:loop/6
<6315.469.0>        riak_kv_get_fsm_sj_stats      '-'    1139338       5720          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           08:35:46
 Load:  cpu         0               Memory:  total       47668    binary       4864
        procs    1093                        processes   13089    code        12629
        runq        0                        atom          541    ets          6226

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.104.0>        erlang:apply/2                '-'       2351       2600          0 cpu_sup:measurement_server_loop/1
<6315.93.0>         riak_sysmon_filter            '-'       2284       5760          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        662     143728          0 gen_server:loop/6
<6315.485.0>        riak_kv_stat_sj_stats         '-'        385       8736          0 gen_server:loop/6
<6315.453.0>        riak_kv_put_fsm_sj_stats      '-'        382       8736          0 gen_server:loop/6
<6315.469.0>        riak_kv_get_fsm_sj_stats      '-'        361       5720          0 gen_server:loop/6
<6315.210.0>        riak_core_stat_cache          '-'        189      55016          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_calc_sup       '-'        153       5960          0 gen_server:loop/6
<6315.390.0>        riak_api_pb_sup               '-'         46       2704          0 gen_server:loop/6
<6315.94.0>         timer_server                  '-'         44       2784          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           08:35:48
 Load:  cpu         0               Memory:  total       47477    binary       4864
        procs    1093                        processes   12897    code        12629
        runq        0                        atom          541    ets          6226

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.191.0>        riak_core_vnode_manager       '-'      13515     146432          0 gen_server:loop/6
<6315.194.0>        riak_core_capability          '-'       6106      55016          0 gen_server:loop/6
<6315.104.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.7.0>          application_controller        '-'        871      89544          0 gen_server:loop/6
<6315.195.0>        riak_core_gossip              '-'        827      88504          0 gen_server:loop/6
<6315.469.0>        riak_kv_get_fsm_sj_stats      '-'        775       8736          0 gen_server:loop/6
<6315.210.0>        riak_core_stat_cache          '-'        750      55016          0 gen_server:loop/6
<6315.453.0>        riak_kv_put_fsm_sj_stats      '-'        740       5720          0 gen_server:loop/6
<6315.485.0>        riak_kv_stat_sj_stats         '-'        729       8736          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_calc_sup       '-'        332       5960          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           08:35:49
 Load:  cpu         0               Memory:  total       47523    binary       4865
        procs    1093                        processes   12916    code        12629
        runq        0                        atom          541    ets          6226

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.104.0>        erlang:apply/2                '-'       2351       2600          0 cpu_sup:measurement_server_loop/1
<6315.93.0>         riak_sysmon_filter            '-'       2294       5760          0 gen_server:loop/6
<6315.7.0>          application_controller        '-'        672     143728          0 gen_server:loop/6
<6315.485.0>        riak_kv_stat_sj_stats         '-'        394       8736          0 gen_server:loop/6
<6315.469.0>        riak_kv_get_fsm_sj_stats      '-'        363       5720          0 gen_server:loop/6
<6315.453.0>        riak_kv_put_fsm_sj_stats      '-'        358       8736          0 gen_server:loop/6
<6315.210.0>        riak_core_stat_cache          '-'        189      55016          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_calc_sup       '-'        179       5960          0 gen_server:loop/6
<6315.94.0>         timer_server                  '-'         63       2784          0 gen_server:loop/6
<6315.390.0>        riak_api_pb_sup               '-'         46       2704          0 gen_server:loop/6

===============================================================================================================================
 'riak@10.0.1.1'                                                           08:35:51
 Load:  cpu         0               Memory:  total       47476    binary       4864
        procs    1093                        processes   12896    code        12629
        runq        0                        atom          541    ets          6226

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6315.104.0>        erlang:apply/2                '-'       4702       2600          0 cpu_sup:measurement_server_loop/1
<6315.7.0>          application_controller        '-'        871      89544          0 gen_server:loop/6
<6315.453.0>        riak_kv_put_fsm_sj_stats      '-'        761       8736          0 gen_server:loop/6
<6315.485.0>        riak_kv_stat_sj_stats         '-'        752       5720          0 gen_server:loop/6
<6315.210.0>        riak_core_stat_cache          '-'        750      55016          0 gen_server:loop/6
<6315.469.0>        riak_kv_get_fsm_sj_stats      '-'        748       8736          0 gen_server:loop/6
<6315.211.0>        riak_core_stat_calc_sup       '-'        332       5960          0 gen_server:loop/6
<6315.93.0>         riak_sysmon_filter            '-'        285       5760          0 gen_server:loop/6
<6315.390.0>        riak_api_pb_sup               '-'         96       2704          0 gen_server:loop/6
<6315.103.0>        cpu_sup                       '-'         67       2704          0 gen_server:loop/6
^C
russelldb commented 10 years ago

I wonder if it is stats? Could you turn up the stat calc TTL to something really long?

I can't find the setting in cuttlefish, so I guess it'll need to go in your advanced.config

[{riak_core, [{stat_cache_ttl, $SOME_NUMBER_OF_SECONDS}]}].

Make $SOME_NUMBER_OF_SECONDS big to make infrequent the background calculation of stats, and let me know what difference that makes, please?

Also, until we have a known issue, the mailing list[1] is the best place for this kind of thing, then the community at large benefits, there is general visibility. Assuming we then find an issue as a result, we can open it here. Please?

[1] riak users mailing list - http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

oleksiyk commented 10 years ago

Sorry about using github instead of mailing list, will use it next time!

I've added the following to advanced.config:

[
    {riak_core, [
    {stat_cache_ttl, 30}
    ]}
].

And it helped a bit, however I still see the CPU load spikes at about 20% each 5-10 seconds (not 30 as configured for stats ttl).

russelldb commented 10 years ago

Is the spike frequency and size the same whatever TTL setting you use? For example, change that TTL to 120, and is the behaviour the same?

oleksiyk commented 10 years ago

It seems it has settled by itself. The load by beam.smp now is constant 1% with spikes not over 5%

oleksiyk commented 10 years ago

Is it safe to set stat_cache_ttl to 30 secs in production?

russelldb commented 10 years ago

So that setting solves your problem? This suggests that there is an issue with stat calculation taking up too much resource.

What is a safe setting for the frequency of stat calculation depends on how often you consume them. A 30 second window might be too big for most users. Once a second is probably too frequent for most users too.

Tune it based on what is acceptable to you as a trade-off CPU usage vs. freshness of stats.

I think we should keep this issue open, though, one second frequency of stats calc shouldn't really cause constant CPU load, imo.

oleksiyk commented 10 years ago

Yes, once I set stat_cache_ttle to 1 second the CPU load by beam.smp is back to minimum 4-7% with spikes to 20-30%

russelldb commented 10 years ago

Hmm. Ok. Well, let's leave this open and hopefully we can look at it before 2.0RC.

Thanks.

oleksiyk commented 10 years ago

Yes, and riak-admin top shows cpu_sup:measurement_server_loop/1 at the top of each iteration

russelldb commented 10 years ago

Interesting. I'll look into it…though if you figure it out and send a patch/PR, I'll buy you a beverage of your choice.

michellep-basho commented 10 years ago

Hi Russell,

Do you have any updates on this?

thanks, Michelle

russelldb commented 10 years ago

nope

russelldb commented 10 years ago

@oleksiyk can you let me know what OS you are using, please?

russelldb commented 10 years ago

@oleksiyk I'm not seeing this on OS X. I'm running the latest beta. Would you mind trying the current beta, and reporting back, please?

oleksiyk commented 10 years ago

@russelldb OS is Ubuntu 14.04 I'm running riak_2.0.0beta1-1 stat_cache_ttl is set to 1 second CPU load by beam SMP is 4-7% (idle). I think that's totally acceptable.

russelldb commented 10 years ago

@oleksiyk so I can close this issue?

oleksiyk commented 10 years ago

Well I haven't noticed any slow down when putting real load on Riak so probably you can close this issue.

russelldb commented 10 years ago

Thank you