Closed benoitc closed 5 years ago
@benoitc and you have verified this does not happen with the old script when used with Erlang 21.2?
@djnym ping
@tsloughter we were forced to update due to hex.pm incompatibility. Worked before AFAIK
@deadtrickster ok
Does anyone have any ideas? :)
@tsloughter imo this is due to the usage of $ROOTDIR
instead of a temporary directory. Since Erlang has no right to it during the build it will fail or something like it.
I'm not sure what you mean. The script isn't writing this file and the change was removing file creation that was added at some point, prior to this use of a tmp dir to write the new nodetool it was not written at all and there was not this cookie issue.
Why would a cookie file be created at all when a cookie is provided in the arguments?
true. Error happen at this step: https://github.com/erlang/otp/blob/master/lib/kernel/src/auth.erl#L286 so i believe the cookie argument is empty.
Can either of you add a line to spit out the vm args file at the time of running (so after the replacement of os vars) and see if the setcookie is still there and proper?
I am stumped. I can't figure out where the hell this is coming from.
I just built a release with rebar3 3.8.0, which is before relx edad2b4, and I'm still getting the cookie file created.
What version of relx are you using?
I've built releases with the newer relx and not had issues, but maybe only with 18.3.x and 20.3.x. Is there a possibility this is a FreeBSD difference of some sort? Is there a docker image, docker file, or virtual machine image which we could use to recreate your issue @benoitc?
I get it on Linux as well, but seem to have the same cookie file created when using 3.8.0 as well.
And it is only an issue if writing to $HOME/.erlang.cookie
fails. @djnym can you check if this file is created for you?
The generated start script contains
# run a dummy distributed erlang node just to ensure that a cookie exists
$ERTS_DIR/bin/erl -sname dummy -boot no_dot_erlang -noshell -eval "halt()"
This was not there when using rebar 3.9.0
There are more differences but this one looks suspicious
@erikdahmen yea, I manually modified that in the script and was still getting the file created. As well as tried it on 3.8.0.
Maybe I screwed something up when testing, so I'd like someone else to validate if they still get the file created as well.
Yes, it looks like .erlang.cookie is created for me using rebar3 3.9.0, but it's also created with rebar3 3.6.2 (which is the version we currently use), so I'm not sure what the issue is here. I do control the $HOME directory for my releases and ensure it's owned by the user running my service. I still think this is an issue with the setup and not with relx, but we'll have to wait to hear back from @benoitc more details.
@djnym thanks.
And yea, starting to think the same and this is unrelated to your change.
like i said in slack, just downgrading the version of relx and provide our rebar3 was enough to fix the issue without any changes on the config (i will double check this later today). Why the latest change was needed btw?
exactly we using this branch now https://github.com/kobil-systems/rebar3/tree/kobil
Can you both verify that the .erlang.cookie
file is not created on these releases you build from an earlier version.
The recent patch needed to be added because it was rewriting nodetool, which was a bad hack and required being able to write a file.
@erikdahmen ^ could you please do that?
I can confirm that in general .erlang.cookie
also gets created by earlier versions. In the user's home directory.
However, we run our releases with users that have no home directory. rebar3 versions before 3.9.1 don't seem to care that the cookie can't be created. rebar3 3.9.1 tries to create it in /root which fails.
https://github.com/kobil-systems/rebar3/tree/3.9.1-kobil once again works for us.
I'm going to agree with @erikdahmen that this commit really seems like the one that causes the issue https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 @erikdahmen do you think you could try reversing the patch in that ticket and seeing if it causes the issue to go away?
FYI @benoitc here was the reason for the last PR I made https://github.com/erlware/relx/pull/649
Okay, so a few tests and I'm still not sure what's happening. It seems like if you use '-sname' or '-name' without '-setcookie' it will write the cookie file in '$HOME/.erlang.cookie'. If $HOME is unwritable it will fail with the eaccess.
HOME=/tmp erl -sname dummy -boot no_dot_erlang -noshell -eval 'halt()' ls /tmp/.erlang.cookie /tmp/.erlang.cookie
But remove the cookie and run with '-setcookie' and you don't get the file
rm -f /tmp/.erlang.cookie HOME=/tmp erl -sname dummy -boot no_dot_erlang -noshell -eval 'halt()' -setcookie foo ls /tmp/.erlang.cookie ls: cannot access /tmp/.erlang.cookie: No such file or directory
The place where the '-sname dummy' was added is before other cookie mangling, so it doesn't pick up any of the vm.args things, it just starts and halts. I think it was meant to attempt to create the cookie file, but I'm not sure why that is needed. However just reversing the patch from before doesn't seem to fix the issue if a user has a non-writable '$HOME' as erl is invoked in other areas without the setcookie arg, like in the relx_get_nodename function. I'm really not certain how it worked before as I can't seem to make it not care about $HOME and specifically reset HOME myself in my wrapper around the generated nodetool which gets added to /etc/init.d/. So still a bit stumped.
My plugin's cron builds using rebar3's nightly build also started failing when 3.9.1
got tagged (7 days ago, https://travis-ci.org/lrascao/rebar3_appup_plugin/builds/506115258) and https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 also seems to be at the root of it. Using a rebar3 with a local relx with mentioned commit removed and everything starts working again. I suggest we revert it and continue looking into the causes of this error.
ping @uwiger @tolbrino
This may not the root cause for what is being discussed here but what i'm seeing in the cron build is a node failing to start when ./bin/<app> ping
is being run right after ./bin/<app> start
, the node fails to start with the error:
Protocol 'inet_tcp': the name dummy@<host> seems to be in use by another Erlang node
A small delay between the two and the error goes away.
This may not the root cause for what is being discussed here but what i'm seeing in the cron build is a node failing to start when
./bin/<app> ping
is being run right after./bin/<app> start
, the node fails to start with the error:Protocol 'inet_tcp': the name dummy@<host> seems to be in use by another Erlang node
A small delay between the two and the error goes away.
I tried to address this in https://github.com/erlware/relx/pull/690
The changes from #678 are indeed what breaks for @benoitc . However, I do consider trying to create that file a feature of relx because it enables the creation of a separate class of releases without relying on provisioning tools.
To fix this I'd move the cookie check code into the pre-start/pre-console phase, where it really matters and make it optional in the sense that even if it fails, the rest of the procedure will continue. I feel like relx can only do so much here and handling all system error cases is too much.
If you agree I'll provide a PR.
@tolbrino sounds good to me.
@djnym: I have manually undone the changes of https://github.com/erlware/relx/commit/8d947fcadb3770f51c4aae73bc4a55ea979bc640 in the start script and this solves the problem.
I guess the question for @tsloughter then, is do we revert 8d947fc and release a version, or do we wait for @tolbrino to send in another patch to hopefully fix it? I'd vote for the latter assuming @erikdahmen can work off his branch for a while and would be willing to help test @tolbrino's patch? Any other opinions?
@djnym I've adapted the existing PR https://github.com/erlware/relx/pull/690 . Although I'm still testing the Windows parts, you may already verify that the Unix parts are good.
@djnym Yes, we're good for the moment. I'm also happy to retest, but it will be a few weeks before I can do that.
@tolbrino sounds good. @erikdahmen if you are running on Unix of some form try out @tolbrino’s patch and see if it works. It would be great to catch any isssues from the folks who run in the strict way you both seem to.
As a heads up, the CI is still failing on the PR, which I'm looking into. However, that seems to be indirectly related only.
As briefly discussed on slack, find the issue related to the erlang cookie file. Upgrading to latest rebar3 and latest relx triggered the following issue when making a release:
I believe this is due to edad2b498ad12ee2860a09f80e7862efadf0eff2 .
vm.args
is pretty simple and only set the cookie. Machine are vmware virtual machines under FreeBSD 11 using erlang 21.2 .Hope it helps.