erlang / rebar3

Erlang build tool that makes it easy to compile and test Erlang applications and releases.
http://www.rebar3.org
Apache License 2.0
1.69k stars 516 forks source link

Cookie eaccess with dist_node and no write permissions in $HOME #2183

Closed Kuroneer closed 4 years ago

Kuroneer commented 4 years ago

Rationale

Using docker (with an uncommon setup) to test dist_node SUITEs, the following error is generated:

=ERROR REPORT==== 28-Nov-2019::20:06:59.543616 ===
Failed to create cookie file '/.erlang.cookie': eacces
=SUPERVISOR REPORT==== 28-Nov-2019::20:06:59.543803 ===
    supervisor: {local,net_sup}
    errorContext: start_error
    reason: {"Failed to create cookie file '/.erlang.cookie': eacces",
             [{auth,init_cookie,0,[{file,"auth.erl"},{line,286}]},
              {auth,init,1,[{file,"auth.erl"},{line,140}]},
              {gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},
              {gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},
              {proc_lib,init_p_do_apply,3,
                        [{file,"proc_lib.erl"},{line,249}]}]}
    offender: [{pid,undefined},
               {id,auth},
               {mfargs,{auth,start_link,[]}},
               {restart_type,permanent},
               {shutdown,2000},
               {child_type,worker}]

The cause of this error is that the user running the tests inside docker "does not exist", so its HOME has been artificially set to /

Environment

Rebar3 report
 version 3.12.0
 generated at 2019-11-28T20:18:32+00:00
=================
Please submit this along with your issue at https://github.com/erlang/rebar3/issues (and feel free to edit out private information, if any)
-----------------
Task: ct
Entered as:
  ct
-----------------
Operating System: x86_64-pc-linux-gnu
ERTS: Erlang/OTP 22 [erts-10.5.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
Root Directory: /usr/local/lib/erlang
Library directory: /usr/local/lib/erlang/lib
-----------------
Loaded Applications:
bbmustache: 1.6.1
certifi: 2.5.1
cf: 0.2.2
common_test: 1.18
compiler: 7.4.6
crypto: 4.6
cth_readable: 1.4.5
dialyzer: 4.1
edoc: 0.11
erlware_commons: 1.3.1
eunit: 2.3.8
eunit_formatters: 0.5.0
getopt: 1.0.1
hipe: 3.19.1
inets: 7.1
kernel: 6.5
providers: 1.8.1
public_key: 1.7
relx: 3.33.0
sasl: 3.4.1
snmp: 5.4.1
ssl_verify_fun: 1.1.5
stdlib: 3.10
syntax_tools: 2.2.1
tools: 3.2.1

-----------------
Escript path: /usr/local/bin/rebar3
Providers:
  app_discovery as clean compile compile cover ct deps dialyzer do edoc escriptize eunit get-deps help install install_deps list lock new path pkgs release relup report repos shell state tar tree unlock unlock_platform update upgrade upgrade upgrade upgrade_platform version warn_outdated_deps warn_outdated_deps_abort xref 

Unfortunately, I cannot provide the actual code I'm testing, but if requested, I'm sure I'll be able to create a project where this is seen.

Current behaviour

As explained in the first section, the issue is when creating the erlang cookie:

(Relevant lines)

===> Provider: {default,lock}
===> Provider: {default,ct}
=ERROR REPORT==== 28-Nov-2019::20:27:08.108006 ===
Failed to create cookie file '/.erlang.cookie': eacces
=SUPERVISOR REPORT==== 28-Nov-2019::20:27:08.108237 ===
    supervisor: {local,net_sup}
    errorContext: start_error
    reason: {"Failed to create cookie file '/.erlang.cookie': eacces",
             [{auth,init_cookie,0,[{file,"auth.erl"},{line,286}]},
              {auth,init,1,[{file,"auth.erl"},{line,140}]},
              {gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},
              {gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},
              {proc_lib,init_p_do_apply,3,
                        [{file,"proc_lib.erl"},{line,249}]}]}
    offender: [{pid,undefined},
               {id,auth},
               {mfargs,{auth,start_link,[]}},
               {restart_type,permanent},
               {shutdown,2000},
               {child_type,worker}]
=CRASH REPORT==== 28-Nov-2019::20:27:08.108117 ===
  crasher:
    initial call: auth:init/1
    pid: <0.418.0>
    registered_name: []
    exception error: "Failed to create cookie file '/.erlang.cookie': eacces"
      in function  auth:init_cookie/0 (auth.erl, line 286)
      in call from auth:init/1 (auth.erl, line 140)
      in call from gen_server:init_it/2 (gen_server.erl, line 374)
      in call from gen_server:init_it/6 (gen_server.erl, line 342)
    ancestors: [net_sup,kernel_sup,<0.46.0>]
    message_queue_len: 0
    messages: []
    links: [<0.416.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 2735
  neighbours:

Expected behaviour

My expectation would be that, having included

        {dist_node, [ % To run distributed tests
            {setcookie, cookie},
            {sname, 'master@localhost'}
        ]},

there would be no need of writing the cookie file and it wouldn't fail to start in distributed mode.

Triage

When erl starts the distributed mode, if no cookie configuration is passed as CLI arg, erl creates the cookie file.

rebar3 executing without these CLI options but with dist_node configuration starts the distributed mode (creating a random cookie file) and then sets the cookie specified in the config. https://github.com/erlang/rebar3/blob/b8f8f3e5d6047feb86d755cadfdbb03c6e0512a0/src/rebar_dist_utils.erl#L53-L61

When rebar3 ct is run with a HOME without W permissions, first it tries to create the cookie file (and fails), regardless of the dist_node config.

I can easily workaround this for my flow by creating a fake HOME or allowing anyone to write anything inside the docker, but I was wondering if it would be worth the trouble creating a PR to fix this issue in rebar3 itself:

One solution would be to make rebar3 escript run with the -nocookie flag, this would avoid creating any cookie file on net_kernel:start, allowing the cookie to be set later on. If I read the code correctly, this solution would require allowing custom emulator args in rebar_prv_escriptize, and adding the flag when escriptizing/running rebar3. This solution would also mean that no cookie file would be created by rebar3, with all the implications this has.

I'm not sure that this is big enough an issue, what do you think?

Kuroneer commented 4 years ago

As an example of different commands (erl vs rebar3 shell):

$ HOME=/ erl -sname test@localhost -setcookie cookie
Erlang/OTP 22 [erts-10.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
Eshell V10.5  (abort with ^G)
(test@localhost)1> 
User switch command
 --> q
$ HOME=/ erl -sname test@localhost                  
2019-11-29 02:06:21.155797 
    args: []
    format: "Failed to create cookie file '/.erlang.cookie': eacces"
<omitted>
Crash dump is being written to: erl_crash.dump...done
cd erlang-project
$ HOME=/ rebar3 shell --sname test@localhost --setcookie cookie
===> Verifying dependencies...                                                                                   
===> Compiling erlang-project
=ERROR REPORT==== 29-Nov-2019::02:06:33.480090 ===
Failed to create cookie file '/.erlang.cookie': eacces
<omitted>
Erlang/OTP 22 [erts-10.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
Eshell V10.5  (abort with ^G)
1> 
User switch command
 --> q
cd erlang-project
$ HOME=/ ERL_FLAGS=' -nocookie' rebar3 shell --sname test@localhost --setcookie cookie
===> Verifying dependencies...
===> Compiling erlang-project
Erlang/OTP 22 [erts-10.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Eshell V10.5  (abort with ^G)
(test@localhost)1> 
User switch command
 --> q
ferd commented 4 years ago

I guess this is tangentially related to https://bugs.erlang.org/browse/ERL-476?jql=text%20~%20%22HOME%22 for the artificial / path being set, but this isn't the main issue here. I would probably point out to people reading the issue that this is likely only a problem for test nodes running distributed, as prod nodes would generally use the vm.args file to set a node name and a cookie explicitly (particularly if they use the templates) and so that would all be avoided.

Setting no cookie by default would probably end up being a problem for all the folks who currently use dist mode without also specifying a cookie and instead relying on the global one, and now their setup would break by returning nocookie when trying to establish a connection, so I don't think defaulting to that is a safe option for now. It's probably going to break more builds than it is going to fix. Unless we had a way to modify start arguments after they have been started, we can't do this conditionally either.

So your workaround is probably one of the few acceptable ways to go about this.

An alternative is possibly to have the cookie written to /tmp, which would require no special permissions; anyone in a container should be able to write there and by doing so you'd skip the eaccess error. It requires setting HOME to point there, but would bypass most of the problem directly as far as I can tell. Another option would be to just pre-generate a cookie file in the container image. Cookies are not security anyway -- they're more to prevent accidentally connecting to a node, and most production builds I run just hardcode the name of the release as a cookie.

I'm a bit annoyed that there's no way to call set_cookie/2 on a node before the network is up. Since the call is based on init args (rather than some intermediary state) and at boot by the auth server, we can't easily work around it (https://github.com/erlang/otp/blob/d6285b0a347b9489ce939511ee9a979acd868f71/lib/kernel/src/auth.erl#L271-L274) otherwise without patching OTP.


Actually right before posting my comment, I realized that we could maybe do something a bit devious here, but I think it would be relatively safe:

  1. start the rebar3 escript with -nocookie as an argument
  2. when starting dist mode, if a cookie is submitted through the config, write it in
  3. when starting dist mode, if no cookie is submitted, manually go read the .erlang.cookie file and insert it
  4. if the .erlang.cookie file does not exist, generate a random cookie name, still set it for the node, but try to write to the .erlang.cookie file while silently ignoring failures

This would, for all intents and purpose, preserve the existing semantics with the distinction that we do not hard-crash if $HOME is not writeable. It risks causing confusion in the case where someone is on a system where $HOME is read-only, the whole thing still boots, but all the individual nodes can't talk because they don't have matching cookies whereas if the filesystem was writeable, they would. There are no errors to explain why that is.

I'm not sure if it's that good of a failure mode since it's so subtle!

Kuroneer commented 4 years ago

I'd like to point that rebar3 does not stop running the tests, just fails to set up the distributed mode.

I think that your steps are correct, but I'd remove 2., as if the cookie is provided, no file is read/written by erl (so the cookie would be expected to be set for the slave nodes manually).

If $HOME is no writable and no cookie is provided, a warning could be issued with something like "Could not write auto-generated cookie, either use '-nocookie' flag or explicitly set the cookie in the slave nodes"

What I'm wondering is if it's worth maintaining or it's better just acknowledging the issue and providing workarounds.

ferd commented 4 years ago

Yeah I'm wondering the same thing. I think it would make sense to keep the existing failure mode if only because it's a better representation of what might happen in prod (if you don't provide CLI args, it will die for this reason when Rebar3 is out of the picture), but it makes for a more confusing experience for our tool itself.

Kuroneer commented 4 years ago

Ok, I'll close this issue then. Thank you!