basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.93k stars 534 forks source link

CentOS 7 Riak 2.9.0p3 crash dump #986

Closed mckenziec closed 4 years ago

mckenziec commented 4 years ago

Version: 2.9.0p3

I was looking to give Riak KV a try and I haven't been able to start the server/console following the instructions here: https://docs.riak.com/riak/kv/2.2.3/setup/installing/rhel-centos.1.html https://docs.riak.com/riak/kv/2.2.3/setup/installing/verify/#starting-a-riak-node

Steps:

Start with a reasonable CentOS 7:

# in term1
$ curl -s https://packagecloud.io/install/repositories/basho/riak/script.rpm.sh | sudo bash
$ sudo yum install riak
$ sudo riak start

# switch to term2
$ sudo riak console
config is OK
-config /var/lib/riak/generated.configs/app.2019.08.29.13.23.38.config -args_file /var/lib/riak/generated.configs/vm.2019.08.29.13.23.38.args -vm_args /var/lib/riak/generated.configs/vm.2019.08.29.13.23.38.args
!!!!
!!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum.
!!!!
Exec:  /usr/lib64/riak/erts-5.10.3/bin/erlexec -boot /usr/lib64/riak/releases/2.2.3/riak               -config /var/lib/riak/generated.configs/app.2019.08.29.13.23.38.config -args_file /var/lib/riak/generated.configs/vm.2019.08.29.13.23.38.args -vm_args /var/lib/riak/generated.configs/vm.2019.08.29.13.23.38.args              -pa /usr/lib64/riak/lib/basho-patches -- console
Root: /usr/lib64/riak
{error_logger,{{2019,8,29},{13,23,39}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]}
{error_logger,{{2019,8,29},{13,23,39}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.246>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,27},{reductions,751}],[]]}
{error_logger,{{2019,8,29},{13,23,39}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[['riak@127.0.0.1',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2019,8,29},{13,23,39}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2019,8,29},{13,23,39}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>},{registered_name,[]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,133}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[<0.8.0>]},{messages,[{'EXIT',<0.10.0>,normal}]},{links,[<0.8.0>,<0.7.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,164}],[]]}
{error_logger,{{2019,8,29},{13,23,39}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}}"}

Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k

# back in term1
!!!!
!!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum.
!!!!
riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.

I have no clue what the problem might be. I have the crash dump attached.

erl_crash.dump.txt

martinsumner commented 4 years ago

You need to up the open files limit (which is where the ulimit warning is coming from) - https://docs.riak.com/riak/kv/2.1.4/using/performance/open-files-limit/

Note that when you run riak console after you have run riak start it will try and start riak again (riak console is just a way of starting riak when troubleshooting riak, it starts riak and outputs logs to the terminal rather than to file). So I think the riak console may be failing as you already have riak started - and so it can't register the same listener twice - hence the "Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} error.

Before you start riak (using either start or console), make sure riak isn't already running on that node.

martinsumner commented 4 years ago

@mckenziec - I've not heard back, is all OK with this? Will close tomorrow if there is nothing further.

mckenziec commented 4 years ago

Sorry about the late reply martin. I realize you replied very quickly but things came up and distracted me, and I ended up moving on and haven't gone back to test your suggestion. I would suggest including the ulimit tuning in the Linux install instructions if its not merely a performance tweak, but really an installation requirement.

I will come back to riak and take another look though. My initial look was motivated by etcd's inability to function in a cluster where failure resulting in 1 node would prevent anyone from accessing kv data from it, because it goes into a RAFT election dead spiral. e.g. you need minimum 3 nodes to survive losing 1 node, but if it's reversed and 2 nodes go down that 1 node can't function at all. I ended up implementing my own limited kv forwarding service that doesn't attempt to deal with locking or value versioning. I just wanted to distribute node unique kv data. I know TMI. Thanks.