erlang / rebar3

Erlang build tool that makes it easy to compile and test Erlang applications and releases.
http://www.rebar3.org
Apache License 2.0
1.7k stars 517 forks source link

epmd not starting properly on Windows 10 #2113

Closed KarlKeiser closed 4 years ago

KarlKeiser commented 5 years ago

Hi

When I run rebar3 eunit for the first time since starting the computer, I get the following output:

Attempting to start epmd...

=INFO REPORT==== 4-Jul-2019::14:26:59 ===
Protocol 'inet_tcp': register/listen error: econnrefused

I then have to kill erlang in the task manager. However, any time after that, there are no errors. My guess is that the epmd is the issue, because if I quite it with epmd.exe -kill the error reappears the next time I run rebar3. Likewise if I launch the epmd manually before running rebar3 for the first time, the error doesn't happen. Perhaps the epmd isn't ready yet when it is used immediately after starting?

Here's the rebar3 report:

$ rebar3 report eunit
Rebar3 report
 version 3.11.1+build.4404.ref0d9e5bb3
 generated at 2019-07-04T12:30:27+00:00
=================
Please submit this along with your issue at https://github.com/erlang/rebar3/issues (and feel free to edit out private information, if any)
-----------------
Task: eunit
Entered as:
  eunit
-----------------
Operating System: win32
ERTS: Erlang/OTP 20 [erts-9.2] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
Root Directory: c:/Program Files/erl9.2
Library directory: c:/Program Files/erl9.2/lib
-----------------
Loaded Applications:
bbmustache: 1.6.1
certifi: 2.5.1
cf: 0.2.2
common_test: 1.15.3
compiler: 7.1.4
crypto: 4.2
cth_readable: 1.4.4
dialyzer: 3.2.3
edoc: 0.9.2
erlware_commons: 1.3.1
eunit: 2.3.5
eunit_formatters: 0.5.0
getopt: 1.0.1
hipe: 3.17
inets: 6.4.5
kernel: 5.4.1
providers: 1.8.1
public_key: 1.5.2
relx: 3.32.1
sasl: 3.1.1
snmp: 5.2.9
ssl_verify_fun: 1.1.5
stdlib: 3.4.3
syntax_tools: 2.1.4
tools: 2.11.1

-----------------
Escript path: c:/Program Files/rebar3/rebar3
Providers:
  app_discovery as clean compile compile cover ct deps dialyzer do edoc escriptize eunit get-deps help install install_deps list lock new path pkgs release relup report repos shell state tar tree unlock update upgrade upgrade upgrade version xref

The error occurs in this project, but I don't think the issue is anything project specific. I can reproduce the error by running rebar3 eunit with the following test file:

-module(odpi_eunit).
-include_lib("eunit/include/eunit.hrl").
eunit_test_() -> ?_assert(true).

The rebar config file is here. It's not a very big issue as I'm losing just a few seconds a day, and on travis we just start the epmd manually, but we wouldn't mind a more elegant solution. ;)

Thank you for your time.

ferd commented 5 years ago

Hm, never seen this before on Windows or otherwise. What you're saying is that empd is already running when the error happens, hence the need to kill it and restart it? The econnrefused makes me think it's one of a few things:

Next time it happens, try calling epmd -names in the terminal before killing it, and see if there's anything odd out there: similar nodes with the same name hanging around, and if so, whether the process is actually there or not. Usually epmd -kill wouldn't work there if the list wasn't empty though, so it would be surprising if it returned anything.

The second point is possible, but it would be interesting to start another node right next to the stuck one with erl directly and see what shows up in the list instead. If it works, we can blame something odd in rebar3; if it fails, we can blame EPMD itself and that's likely an issue to open up in the OTP project.

The last point would be something I just guess happens when someone has denied the permission for the program to connect to the network, and I can't really imagine it taking place, but my understanding of windows permissions isn't great so I'm leaving that possibility open.

KarlKeiser commented 5 years ago

Hm, never seen this before on Windows or otherwise. What you're saying is that empd is already running when the error happens, hence the need to kill it and restart it?

Well, it only happens the first time (which is when rebar3 starts the epmd). So it should be running, but it just started running. Perhaps it's not ready yet immediately after being started. Is it possible that starting the epmd is asynchronous and it is not done starting yet when it is needed immediately afterwards?

Next time it happens, try calling epmd -names in the terminal before killing it, and see if there's anything odd out there:

I get the following output:

$  '/c/Program Files/erl9.2/erts-9.2/bin/epmd.exe' -names
epmd: up and running on port 4369 with data:
name a at port 51231

The second point is possible, but it would be interesting to start another node right next to the stuck one with erl directly and see what shows up in the list instead.

This works for me. While it is stuck in one shell, I can open a new one and run the tests there. The node gets added to the list and the disappears again when the tests are done. If I run this command

erl -rsh ssh -sname foo -setcookie mycookie

Then I get this for the list:

$  '/c/Program Files/erl9.2/erts-9.2/bin/epmd.exe' -names
epmd: up and running on port 4369 with data:
name foo at port 52744
name a at port 56632
ferd commented 5 years ago

And to be sure, name a is the stuck node? That's kind of odd that epmd stores its name and whatnot, but the node itself fails.

c-bik commented 5 years ago

And to be sure, name a is the stuck node? That's kind of odd that epmd stores its name and whatnot, but the node itself fails.

Here is the mysterious a https://github.com/erlang/rebar3/blob/416176290b20e1e68c5901f83cccef71ec2bc322/src/rebar_dist_utils.erl#L70

ferd commented 5 years ago

odd, so that one was starting but is kinda stuck in a stopped mode. Were you trying to also start node a per chance? Maybe that was a bit racy and caused a name clash. We should make the name random instead.

c-bik commented 5 years ago

Did a quick test which shows the same start hang of epmd

image

It doesn't seem like rebar's fault except perhaps assuming that os:cmd("erl -sname a -eval 'halt(0).'"). returns in windows erts-9.2!

Could be that those two hidden CMD shell is waiting for an user interaction?

Perhaps -detached might help: image

os:cmd("erl -sname a -detached -eval 'halt(0).'"). 
ferd commented 5 years ago

Interesting. The idea is that the -eval 'halt(0).' should make the node quit and exit, and almost immediately return. If you want a quick check, replace erl with werl.exe and a whole window should open along with it.

KarlKeiser commented 5 years ago

I just tried compiling rebar3 with -detached put in and it works! I still get the econnrefused error message when the epmd was not started yet, but it correctly runs the tests nevertheless:

=INFO REPORT==== 5-Jul-2019::16:18:38 ===
Protocol 'inet_tcp': register/listen error: econnrefused
======================== EUnit ========================
file "oranif.app"
  application 'oranif'
    module 'dpi'
    module 'dpi_transform'

Also, when I list the names in epmd the node a is not to be seen anymore.

ferd commented 5 years ago

Alright. As long as the -eval remains it could make sense. It sounds a bit tricky because removing it and going detached means we no longer have a synchronous load mechanism and risk being even more race-condition prone I think. If you can try the werl.exe thing quickly we might get some more info, and in the worst case, if we get nothing, we'll have to go for -detached (albeit only on Windows)

c-bik commented 5 years ago

... If you can try the werl.exe thing quickly we might get some more info...

image But, interestingly... image

The error with werl.exe probably a local problem in my setup with paths!

Anyways, @ferd, you are right werl.exe exits immediately, but so does erl.exe!

ferd commented 5 years ago

curious. Then I'm a bit stumped about the hanging. It might just make sense to special case windows and go with -detached in this case.

c-bik commented 5 years ago

curious. Then I'm a bit stumped about the hanging.

Here is how to cause the hang (windows only)! image

Since in a CentOs 6 : halt(0). returns os:cmd/1

[bikram@WKS015 ~]$ epmd -names
epmd: Cannot connect to local epmd
[bikram@WKS015 ~]$ erl
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V9.2  (abort with ^G)
% `epmd -names` is still `epmd: Cannot connect to local epmd` at this point
1> os:cmd("erl -sname a -eval 'halt(0).'").
[]
2>
ferd commented 5 years ago

Well that's sad. Will add a workaround for windows.

ferd commented 5 years ago

Oh quick check if you can: does it work better if you add in arguments such as -noinput or -nouser rather than -detached ? If it's some stdio handling that causes this, those would fix it without involving async returns.

c-bik commented 5 years ago

@karl was compiling rebar3 and trying things out. He left for the day. I will see what I can do quickly!

c-bik commented 5 years ago

does it work better if you add in arguments such as -noinput or -nouser rather than -detached?

Ok tried and few more revelitions 😮

Test:

Microsoft Windows [Version 10.0.17763.557]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\Bikram>erl
Eshell V9.2  (abort with ^G)
% check it the following blocks with different flags
1> os:cmd("erl -sname a *** -eval 'halt(0).'").
flags empd already running blocks
-noinput yes yes
-noinput no yes
-nouser yes yes
-nouser no yes
-noinput -nouser yes yes
-noinput -nouser no yes
-detached yes no
-detached no no

In all the cases above EPMD was started successfully if it wasn't running before

If it's some stdio handling that causes this, those would fix it without involving async returns.

wondering if open_port behaves differently than os:cmd. at least with port stdio can be better controlled (I guess) :suspect:

Can try port on Monday.

Have a nice weekend!

ferd commented 5 years ago

Alright, I'll wait for the result on Monday before writing a patch, it's not a super long one, and not a blocking bug either.

c-bik commented 5 years ago

@ferd Couldn't wait till Monday 😉.

Test with open_port in stead of os:cmd

When no epmd is running

Make sure epmd isn't running already

> epmd --names
epmd: Cannot connect to local epmd

Start an erlang node as port (with exit_status option) without -eval 'halt(0).' parameter

> f(Port).
> Port = open_port({spawn, "erl -sname a"}, [exit_status]).

% flush to show that port is up and running (can be done with receive too)
> flush().
Shell got {#Port<0.426>,{data,"Eshell V9.2  (abort with ^G)\n"}}
Shell got {#Port<0.426>,{data,"(a@WKS006)1> "}}

epmd is now started and node a is in the list

> epmd --names
epmd: up and running on port 4369 with data:
name a at port 1796

Send halt(0) to the node and make sure port exits

> true = erlang:port_command(Port, "halt(0).\r\n").
> receive {Port, {exit_status, 0}} -> ok end.
ok

epmd is still running but node a is gone

> epmd --names
epmd: up and running on port 4369 with data:

When epmd is already running

Make sure epmd is running and there isn't any node named a

> epmd --names
epmd: up and running on port 4369 with data:

Start an erlang node as port (with exit_status option) without -eval 'halt(0).' parameter

> f(Port).
> Port = open_port({spawn, "erl -sname a"}, [exit_status]).

% flush to show that port is up and running (can be done with receive too)
> flush().
Shell got {#Port<0.426>,{data,"Eshell V9.2  (abort with ^G)\n"}}
Shell got {#Port<0.426>,{data,"(a@WKS006)1> "}}

epmd now shows node a is in the list

> epmd --names
epmd: up and running on port 4369 with data:
name a at port 1941

Send halt(0) to the node and make sure port exits

> true = erlang:port_command(Port, "halt(0).\r\n").
> receive {Port, {exit_status, 0}} -> ok end.
ok

epmd is still running but node a is gone

> epmd --names
epmd: up and running on port 4369 with data:

In Conclusion

It seem, with port (though a bit more elaborate) but provides better running state transparency (of erl -sname a) as compared to os:cmd.

I also noticed while doing this test, that -eval 'halt(0).' is ignored in windows (sometimes) - possibly a race between node start and eval!

The above test can be converted into an implementation and can replace https://github.com/erlang/rebar3/blob/416176290b20e1e68c5901f83cccef71ec2bc322/src/rebar_dist_utils.erl#L70 with a few more lines of code but in an OS independent way!

I can also PR this quickly if you like.

ferd commented 5 years ago

I also noticed while doing this test, that -eval 'halt(0).' is ignored in windows (sometimes) - possibly a race between node start and eval!

I figure this is the core of the bug right there. I'll try to use the port program as a workaround for windows, since the os:cmd/1 version is cleaner and easier to maintain in all cases so if things eventually get fixed we can revert to that.

garazdawi commented 5 years ago

I've not really been following the details but if you think this is something that we can fix in erlang/otp, please do file a ticket at bugs.erlang.org.

ferd commented 5 years ago

@garazdawi from the latest post, it seems that in some odd cases, erl -sname a -eval 'halt(0).', when run on windows and that EPMD needs to be started, the process will either not die (nor return) or could die without having started EPMD properly?

garazdawi commented 5 years ago

I'm on vacation right now, so don't have time to look closer. If you think it is a bug open an issue and I or someone else will take q loo at it.

ferd commented 5 years ago

Opened at https://bugs.erlang.org/browse/ERL-994

ferd commented 4 years ago

Fixed by @garazdawi