Closed KarlKeiser closed 4 years ago
Hm, never seen this before on Windows or otherwise. What you're saying is that empd
is already running when the error happens, hence the need to kill it and restart it? The econnrefused
makes me think it's one of a few things:
Next time it happens, try calling epmd -names
in the terminal before killing it, and see if there's anything odd out there: similar nodes with the same name hanging around, and if so, whether the process is actually there or not. Usually epmd -kill
wouldn't work there if the list wasn't empty though, so it would be surprising if it returned anything.
The second point is possible, but it would be interesting to start another node right next to the stuck one with erl
directly and see what shows up in the list instead. If it works, we can blame something odd in rebar3; if it fails, we can blame EPMD itself and that's likely an issue to open up in the OTP project.
The last point would be something I just guess happens when someone has denied the permission for the program to connect to the network, and I can't really imagine it taking place, but my understanding of windows permissions isn't great so I'm leaving that possibility open.
Hm, never seen this before on Windows or otherwise. What you're saying is that empd is already running when the error happens, hence the need to kill it and restart it?
Well, it only happens the first time (which is when rebar3
starts the epmd
). So it should be running, but it just started running. Perhaps it's not ready yet immediately after being started. Is it possible that starting the epmd
is asynchronous and it is not done starting yet when it is needed immediately afterwards?
Next time it happens, try calling epmd -names in the terminal before killing it, and see if there's anything odd out there:
I get the following output:
$ '/c/Program Files/erl9.2/erts-9.2/bin/epmd.exe' -names
epmd: up and running on port 4369 with data:
name a at port 51231
The second point is possible, but it would be interesting to start another node right next to the stuck one with erl directly and see what shows up in the list instead.
This works for me. While it is stuck in one shell, I can open a new one and run the tests there. The node gets added to the list and the disappears again when the tests are done. If I run this command
erl -rsh ssh -sname foo -setcookie mycookie
Then I get this for the list:
$ '/c/Program Files/erl9.2/erts-9.2/bin/epmd.exe' -names
epmd: up and running on port 4369 with data:
name foo at port 52744
name a at port 56632
And to be sure, name a
is the stuck node? That's kind of odd that epmd stores its name and whatnot, but the node itself fails.
And to be sure, name
a
is the stuck node? That's kind of odd that epmd stores its name and whatnot, but the node itself fails.
Here is the mysterious a
https://github.com/erlang/rebar3/blob/416176290b20e1e68c5901f83cccef71ec2bc322/src/rebar_dist_utils.erl#L70
odd, so that one was starting but is kinda stuck in a stopped mode. Were you trying to also start node a
per chance? Maybe that was a bit racy and caused a name clash. We should make the name random instead.
Did a quick test which shows the same start hang of epmd
It doesn't seem like rebar's fault except perhaps assuming that os:cmd("erl -sname a -eval 'halt(0).'").
returns in windows erts-9.2!
Could be that those two hidden CMD shell is waiting for an user interaction?
Perhaps -detached
might help:
os:cmd("erl -sname a -detached -eval 'halt(0).'").
Interesting. The idea is that the -eval 'halt(0).'
should make the node quit and exit, and almost immediately return. If you want a quick check, replace erl
with werl.exe
and a whole window should open along with it.
I just tried compiling rebar3
with -detached
put in and it works! I still get the econnrefused
error message when the epmd
was not started yet, but it correctly runs the tests nevertheless:
=INFO REPORT==== 5-Jul-2019::16:18:38 ===
Protocol 'inet_tcp': register/listen error: econnrefused
======================== EUnit ========================
file "oranif.app"
application 'oranif'
module 'dpi'
module 'dpi_transform'
Also, when I list the names
in epmd
the node a
is not to be seen anymore.
Alright. As long as the -eval
remains it could make sense. It sounds a bit tricky because removing it and going detached means we no longer have a synchronous load mechanism and risk being even more race-condition prone I think. If you can try the werl.exe
thing quickly we might get some more info, and in the worst case, if we get nothing, we'll have to go for -detached
(albeit only on Windows)
... If you can try the
werl.exe
thing quickly we might get some more info...
But, interestingly...
The error with werl.exe probably a local problem in my setup with paths!
Anyways, @ferd, you are right werl.exe
exits immediately, but so does erl.exe
!
curious. Then I'm a bit stumped about the hanging. It might just make sense to special case windows and go with -detached
in this case.
curious. Then I'm a bit stumped about the hanging.
Here is how to cause the hang (windows only)!
Since in a CentOs 6 : halt(0).
returns os:cmd/1
[bikram@WKS015 ~]$ epmd -names
epmd: Cannot connect to local epmd
[bikram@WKS015 ~]$ erl
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V9.2 (abort with ^G)
% `epmd -names` is still `epmd: Cannot connect to local epmd` at this point
1> os:cmd("erl -sname a -eval 'halt(0).'").
[]
2>
Well that's sad. Will add a workaround for windows.
Oh quick check if you can: does it work better if you add in arguments such as -noinput
or -nouser
rather than -detached
? If it's some stdio handling that causes this, those would fix it without involving async returns.
@karl was compiling rebar3
and trying things out. He left for the day. I will see what I can do quickly!
does it work better if you add in arguments such as
-noinput
or-nouser
rather than-detached
?
Ok tried and few more revelitions 😮
Test:
Microsoft Windows [Version 10.0.17763.557]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\Bikram>erl
Eshell V9.2 (abort with ^G)
% check it the following blocks with different flags
1> os:cmd("erl -sname a *** -eval 'halt(0).'").
flags | empd already running | blocks |
---|---|---|
-noinput |
yes | yes |
-noinput |
no | yes |
-nouser |
yes | yes |
-nouser |
no | yes |
-noinput -nouser |
yes | yes |
-noinput -nouser |
no | yes |
-detached |
yes | no |
-detached |
no | no |
In all the cases above EPMD was started successfully if it wasn't running before
If it's some stdio handling that causes this, those would fix it without involving async returns.
wondering if open_port
behaves differently than os:cmd
. at least with port stdio can be better controlled (I guess) :suspect:
Can try port on Monday.
Have a nice weekend!
Alright, I'll wait for the result on Monday before writing a patch, it's not a super long one, and not a blocking bug either.
@ferd Couldn't wait till Monday 😉.
open_port
in stead of os:cmd
epmd
is runningepmd
isn't running already> epmd --names
epmd: Cannot connect to local epmd
exit_status
option) without -eval 'halt(0).'
parameter> f(Port).
> Port = open_port({spawn, "erl -sname a"}, [exit_status]).
% flush to show that port is up and running (can be done with receive too)
> flush().
Shell got {#Port<0.426>,{data,"Eshell V9.2 (abort with ^G)\n"}}
Shell got {#Port<0.426>,{data,"(a@WKS006)1> "}}
epmd
is now started and node a
is in the list> epmd --names
epmd: up and running on port 4369 with data:
name a at port 1796
halt(0)
to the node and make sure port exits> true = erlang:port_command(Port, "halt(0).\r\n").
> receive {Port, {exit_status, 0}} -> ok end.
ok
epmd
is still running but node a
is gone> epmd --names
epmd: up and running on port 4369 with data:
epmd
is already runningepmd
is running and there isn't any node named a
> epmd --names
epmd: up and running on port 4369 with data:
exit_status
option) without -eval 'halt(0).'
parameter> f(Port).
> Port = open_port({spawn, "erl -sname a"}, [exit_status]).
% flush to show that port is up and running (can be done with receive too)
> flush().
Shell got {#Port<0.426>,{data,"Eshell V9.2 (abort with ^G)\n"}}
Shell got {#Port<0.426>,{data,"(a@WKS006)1> "}}
epmd
now shows node a
is in the list> epmd --names
epmd: up and running on port 4369 with data:
name a at port 1941
halt(0)
to the node and make sure port exits> true = erlang:port_command(Port, "halt(0).\r\n").
> receive {Port, {exit_status, 0}} -> ok end.
ok
epmd
is still running but node a
is gone> epmd --names
epmd: up and running on port 4369 with data:
It seem, with port (though a bit more elaborate) but provides better running state transparency (of erl -sname a
) as compared to os:cmd
.
I also noticed while doing this test, that -eval 'halt(0).'
is ignored in windows (sometimes) - possibly a race between node start and eval
!
The above test can be converted into an implementation and can replace https://github.com/erlang/rebar3/blob/416176290b20e1e68c5901f83cccef71ec2bc322/src/rebar_dist_utils.erl#L70 with a few more lines of code but in an OS independent way!
I can also PR this quickly if you like.
I also noticed while doing this test, that -eval 'halt(0).' is ignored in windows (sometimes) - possibly a race between node start and eval!
I figure this is the core of the bug right there. I'll try to use the port program as a workaround for windows, since the os:cmd/1
version is cleaner and easier to maintain in all cases so if things eventually get fixed we can revert to that.
I've not really been following the details but if you think this is something that we can fix in erlang/otp, please do file a ticket at bugs.erlang.org.
@garazdawi from the latest post, it seems that in some odd cases, erl -sname a -eval 'halt(0).'
, when run on windows and that EPMD needs to be started, the process will either not die (nor return) or could die without having started EPMD properly?
I'm on vacation right now, so don't have time to look closer. If you think it is a bug open an issue and I or someone else will take q loo at it.
Opened at https://bugs.erlang.org/browse/ERL-994
Fixed by @garazdawi
Hi
When I run
rebar3 eunit
for the first time since starting the computer, I get the following output:I then have to kill erlang in the task manager. However, any time after that, there are no errors. My guess is that the epmd is the issue, because if I quite it with
epmd.exe -kill
the error reappears the next time I run rebar3. Likewise if I launch the epmd manually before running rebar3 for the first time, the error doesn't happen. Perhaps the epmd isn't ready yet when it is used immediately after starting?Here's the
rebar3 report
:The error occurs in this project, but I don't think the issue is anything project specific. I can reproduce the error by running
rebar3 eunit
with the following test file:The rebar config file is here. It's not a very big issue as I'm losing just a few seconds a day, and on travis we just start the epmd manually, but we wouldn't mind a more elegant solution. ;)
Thank you for your time.