erlware / relx

Sane, simple release creation for Erlang
http://erlware.github.io/relx
Apache License 2.0
697 stars 232 forks source link

error: Failed to create cookie file '/.erlang.cookie': eacces #696

Closed benoitc closed 5 years ago

benoitc commented 5 years ago

As briefly discussed on slack, find the issue related to the erlang cookie file. Upgrading to latest rebar3 and latest relx triggered the following issue when making a release:

[]
"Failed to create cookie file '/.erlang.cookie': eacces"
{error_logger,error_msg}
#{label=>{proc_lib,crash},report=>[[{initial_call,{auth,init,['Argument__1']}},{pid,<0.58.0>},{registered_name,[]},{error_info,{error,"Failed to create cookie file '/.e
rlang.cookie': eacces",[{auth,init_cookie,0,[{file,"auth.erl"},{line,286}]},{auth,init,1,[{file,"auth.erl"},{line,140}]},{gen_server,init_it,2,[{file,"gen_server.erl"},
{line,374}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{ancestors,[net_sup,kernel_s
up,<0.46.0>]},{message_queue_len,0},{messages,[]},{links,[<0.56.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,987},{stack_size,27},{reductions,1584}
],[]]}
#{label=>{supervisor,start_error},report=>[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{"Failed to create cookie file '/.erlang.cookie': eacces",[{a
uth,init_cookie,0,[{file,"auth.erl"},{line,286}]},{auth,init,1,[{file,"auth.erl"},{line,140}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},{gen_server,i
nit_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{offender,[{pid,undefined},{id,auth},{mfargs,{auth,sta
rt_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
#{label=>{supervisor,start_error},report=>[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,auth,{"Failed to create c
ookie file '/.erlang.cookie': eacces",[{auth,init_cookie,0,[{file,"auth.erl"},{line,286}]},{auth,init,1,[{file,"auth.erl"},{line,140}]},{gen_server,init_it,2,[{file,"ge
n_server.erl"},{line,374}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}}}},{offender,[
{pid,undefined},{id,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
#{label=>{proc_lib,crash},report=>[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.45.0>},{registered_name,[
]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,auth,{"Failed to create cookie file '/.erlang.cookie': eacces",[{auth,in
it_cookie,0,[{file,"auth.erl"},{line,286}]},{auth,init,1,[{file,"auth.erl"},{line,140}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},{gen_server,init_it
,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[
{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{ancestors,[<0.44.0>]},{message_queue_len,1},{messages,[
{'EXIT',<0.46.0>,normal}]},{links,[<0.44.0>,<0.43.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,987},{stack_size,27},{reductions,184}],[]]}
#{label=>{application_controller,exit},report=>[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,auth,{"Failed to
 create cookie file '/.erlang.cookie': eacces",[{auth,init_cookie,0,[{file,"auth.erl"},{line,286}]},{auth,init,1,[{file,"auth.erl"},{line,140}]},{gen_server,init_it,2,[
{file,"gen_server.erl"},{line,374}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}}}}},{
kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,auth,{\"Fa
iled to create cookie file '/.erlang.cookie': eacces\",[{auth,init_cookie,0,[{file,\"auth.erl\"},{line,286}]},{auth,init,1,[{file,\"auth.erl\"},{line,140}]},{gen_server
,init_it,2,[{file,\"gen_server.erl\"},{line,374}]},{gen_server,init_it,6,[{file,\"gen_server.erl\"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{l
ine,249}]}]}}}}},{kernel,start,[normal,[]]}}}"}

I believe this is due to edad2b498ad12ee2860a09f80e7862efadf0eff2 . vm.args is pretty simple and only set the cookie. Machine are vmware virtual machines under FreeBSD 11 using erlang 21.2 .

Hope it helps.

tsloughter commented 5 years ago

@benoitc and you have verified this does not happen with the old script when used with Erlang 21.2?

tsloughter commented 5 years ago

@djnym ping

deadtrickster commented 5 years ago

@tsloughter we were forced to update due to hex.pm incompatibility. Worked before AFAIK

tsloughter commented 5 years ago

@deadtrickster ok

Does anyone have any ideas? :)

benoitc commented 5 years ago

@tsloughter imo this is due to the usage of $ROOTDIR instead of a temporary directory. Since Erlang has no right to it during the build it will fail or something like it.

tsloughter commented 5 years ago

I'm not sure what you mean. The script isn't writing this file and the change was removing file creation that was added at some point, prior to this use of a tmp dir to write the new nodetool it was not written at all and there was not this cookie issue.

Why would a cookie file be created at all when a cookie is provided in the arguments?

benoitc commented 5 years ago

true. Error happen at this step: https://github.com/erlang/otp/blob/master/lib/kernel/src/auth.erl#L286 so i believe the cookie argument is empty.

tsloughter commented 5 years ago

Can either of you add a line to spit out the vm args file at the time of running (so after the replacement of os vars) and see if the setcookie is still there and proper?

tsloughter commented 5 years ago

I am stumped. I can't figure out where the hell this is coming from.

tsloughter commented 5 years ago

I just built a release with rebar3 3.8.0, which is before relx edad2b4, and I'm still getting the cookie file created.

What version of relx are you using?

djnym commented 5 years ago

I've built releases with the newer relx and not had issues, but maybe only with 18.3.x and 20.3.x. Is there a possibility this is a FreeBSD difference of some sort? Is there a docker image, docker file, or virtual machine image which we could use to recreate your issue @benoitc?

tsloughter commented 5 years ago

I get it on Linux as well, but seem to have the same cookie file created when using 3.8.0 as well.

tsloughter commented 5 years ago

And it is only an issue if writing to $HOME/.erlang.cookie fails. @djnym can you check if this file is created for you?

erikdahmen commented 5 years ago

The generated start script contains # run a dummy distributed erlang node just to ensure that a cookie exists $ERTS_DIR/bin/erl -sname dummy -boot no_dot_erlang -noshell -eval "halt()"

This was not there when using rebar 3.9.0

There are more differences but this one looks suspicious

tsloughter commented 5 years ago

@erikdahmen yea, I manually modified that in the script and was still getting the file created. As well as tried it on 3.8.0.

Maybe I screwed something up when testing, so I'd like someone else to validate if they still get the file created as well.

djnym commented 5 years ago

Yes, it looks like .erlang.cookie is created for me using rebar3 3.9.0, but it's also created with rebar3 3.6.2 (which is the version we currently use), so I'm not sure what the issue is here. I do control the $HOME directory for my releases and ensure it's owned by the user running my service. I still think this is an issue with the setup and not with relx, but we'll have to wait to hear back from @benoitc more details.

tsloughter commented 5 years ago

@djnym thanks.

And yea, starting to think the same and this is unrelated to your change.

benoitc commented 5 years ago

like i said in slack, just downgrading the version of relx and provide our rebar3 was enough to fix the issue without any changes on the config (i will double check this later today). Why the latest change was needed btw?

deadtrickster commented 5 years ago

exactly we using this branch now https://github.com/kobil-systems/rebar3/tree/kobil

tsloughter commented 5 years ago

Can you both verify that the .erlang.cookie file is not created on these releases you build from an earlier version.

The recent patch needed to be added because it was rewriting nodetool, which was a bad hack and required being able to write a file.

deadtrickster commented 5 years ago

@erikdahmen ^ could you please do that?

erikdahmen commented 5 years ago

I can confirm that in general .erlang.cookie also gets created by earlier versions. In the user's home directory.

However, we run our releases with users that have no home directory. rebar3 versions before 3.9.1 don't seem to care that the cookie can't be created. rebar3 3.9.1 tries to create it in /root which fails.

https://github.com/kobil-systems/rebar3/tree/3.9.1-kobil once again works for us.

djnym commented 5 years ago

I'm going to agree with @erikdahmen that this commit really seems like the one that causes the issue https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 @erikdahmen do you think you could try reversing the patch in that ticket and seeing if it causes the issue to go away?

djnym commented 5 years ago

FYI @benoitc here was the reason for the last PR I made https://github.com/erlware/relx/pull/649

djnym commented 5 years ago

Okay, so a few tests and I'm still not sure what's happening. It seems like if you use '-sname' or '-name' without '-setcookie' it will write the cookie file in '$HOME/.erlang.cookie'. If $HOME is unwritable it will fail with the eaccess.

HOME=/tmp erl -sname dummy -boot no_dot_erlang -noshell -eval 'halt()' ls /tmp/.erlang.cookie /tmp/.erlang.cookie

But remove the cookie and run with '-setcookie' and you don't get the file

rm -f /tmp/.erlang.cookie HOME=/tmp erl -sname dummy -boot no_dot_erlang -noshell -eval 'halt()' -setcookie foo ls /tmp/.erlang.cookie ls: cannot access /tmp/.erlang.cookie: No such file or directory

The place where the '-sname dummy' was added is before other cookie mangling, so it doesn't pick up any of the vm.args things, it just starts and halts. I think it was meant to attempt to create the cookie file, but I'm not sure why that is needed. However just reversing the patch from before doesn't seem to fix the issue if a user has a non-writable '$HOME' as erl is invoked in other areas without the setcookie arg, like in the relx_get_nodename function. I'm really not certain how it worked before as I can't seem to make it not care about $HOME and specifically reset HOME myself in my wrapper around the generated nodetool which gets added to /etc/init.d/. So still a bit stumped.

lrascao commented 5 years ago

My plugin's cron builds using rebar3's nightly build also started failing when 3.9.1 got tagged (7 days ago, https://travis-ci.org/lrascao/rebar3_appup_plugin/builds/506115258) and https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 also seems to be at the root of it. Using a rebar3 with a local relx with mentioned commit removed and everything starts working again. I suggest we revert it and continue looking into the causes of this error.

tsloughter commented 5 years ago

ping @uwiger @tolbrino

lrascao commented 5 years ago

This may not the root cause for what is being discussed here but what i'm seeing in the cron build is a node failing to start when ./bin/<app> ping is being run right after ./bin/<app> start, the node fails to start with the error: Protocol 'inet_tcp': the name dummy@<host> seems to be in use by another Erlang node A small delay between the two and the error goes away.

tolbrino commented 5 years ago

This may not the root cause for what is being discussed here but what i'm seeing in the cron build is a node failing to start when ./bin/<app> ping is being run right after ./bin/<app> start, the node fails to start with the error: Protocol 'inet_tcp': the name dummy@<host> seems to be in use by another Erlang node A small delay between the two and the error goes away.

I tried to address this in https://github.com/erlware/relx/pull/690

tolbrino commented 5 years ago

The changes from #678 are indeed what breaks for @benoitc . However, I do consider trying to create that file a feature of relx because it enables the creation of a separate class of releases without relying on provisioning tools.

To fix this I'd move the cookie check code into the pre-start/pre-console phase, where it really matters and make it optional in the sense that even if it fails, the rest of the procedure will continue. I feel like relx can only do so much here and handling all system error cases is too much.

If you agree I'll provide a PR.

djnym commented 5 years ago

@tolbrino sounds good to me.

erikdahmen commented 5 years ago

@djnym: I have manually undone the changes of https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 in the start script and this solves the problem.

djnym commented 5 years ago

I guess the question for @tsloughter then, is do we revert 8d947fc and release a version, or do we wait for @tolbrino to send in another patch to hopefully fix it? I'd vote for the latter assuming @erikdahmen can work off his branch for a while and would be willing to help test @tolbrino's patch? Any other opinions?

tolbrino commented 5 years ago

@djnym I've adapted the existing PR https://github.com/erlware/relx/pull/690 . Although I'm still testing the Windows parts, you may already verify that the Unix parts are good.

erikdahmen commented 5 years ago

@djnym Yes, we're good for the moment. I'm also happy to retest, but it will be a few weeks before I can do that.

djnym commented 5 years ago

@tolbrino sounds good. @erikdahmen if you are running on Unix of some form try out @tolbrino’s patch and see if it works. It would be great to catch any isssues from the folks who run in the strict way you both seem to.

tolbrino commented 5 years ago

As a heads up, the CI is still failing on the PR, which I'm looking into. However, that seems to be indirectly related only.