dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
384 stars 134 forks source link

--new-coordinator breaks dmtcp_launch path search with misleading message #239

Open loveshack opened 8 years ago

loveshack commented 8 years ago

Consider the following attempts to launch. Using --new-coordinator I have to specify a full path to the command or it fails with a message that seems misleading, but without --new-coordinator it's happy.

This is with version 2.4.2.

$ dmtcp_launch --new-coordinator --port-file coord-port -p 0 sleep 1
[8157] ERROR at coordinatorapi.cpp:418 in startNewCoordinator; REASON='JASSERT(coordinatorListenerSocket.isValid()) failed'
     coordinatorListenerSocket.port() = -1
     (strerror((*__errno_location ()))) = Permission denied
Message: Failed to create listen socket.
If msg is "Address already in use", this may be an old coordinator.
Kill other coordinators and try again in a minute or so.
dmtcp_launch (8157): Terminating...
$ dmtcp_launch --new-coordinator --port-file coord-port -p 0 /bin/sleep 1
$ dmtcp_launch --port-file coord-port -p 0 sleep 1
$
gc00 commented 8 years ago

Hi Dave, I tested this, and I can't reproduce the bug. Is your phenomenon reproducible? (If you wait a little while, does: dmtcp_launch --new-coordinator --port-file coord-port -p 0 sleep 1 still fail? If so, could you let us know the environment in which you're executing: cd DMTCP_ROOT && make display-build-env Thanks,

----- Original Message ----- From: Dave Love notifications@github.com To: dmtcp/dmtcp dmtcp@noreply.github.com Sent: Mon, 23 Nov 2015 09:01:52 -0500 (EST) Subject: [dmtcp] --new-coordinator breaks dmtcp_launch path search with misleading message (#239)

Consider the following attempts to launch. Using --new-coordinator I have to specify a full path to the command or it fails with a message that seems misleading, but without --new-coordinator it's happy.

This is with version 2.4.2.

$ dmtcp_launch --new-coordinator --port-file coord-port -p 0 sleep 1
[8157] ERROR at coordinatorapi.cpp:418 in startNewCoordinator; REASON='JASSERT(coordinatorListenerSocket.isValid()) failed'
     coordinatorListenerSocket.port() = -1
     (strerror((*__errno_location ()))) = Permission denied
Message: Failed to create listen socket.
If msg is "Address already in use", this may be an old coordinator.
Kill other coordinators and try again in a minute or so.
dmtcp_launch (8157): Terminating...
$ dmtcp_launch --new-coordinator --port-file coord-port -p 0 /bin/sleep 1
$ dmtcp_launch --port-file coord-port -p 0 sleep 1
$

Reply to this email directly or view it on GitHub: https://github.com/dmtcp/dmtcp/issues/239

loveshack commented 8 years ago

Gene Cooperman notifications@github.com writes:

Hi Dave, I tested this, and I can't reproduce the bug. Is your phenomenon reproducible? (If you wait a little while, does: dmtcp_launch --new-coordinator --port-file coord-port -p 0 sleep 1 still fail? If so, could you let us know the environment in which you're executing: cd DMTCP_ROOT && make display-build-env Thanks,

  • Gene

It's reproducible on the system I was using, but not on another which is very similar. It's not a question of timing on the system on which it fails, but it's clearly something local which I'll have to try to debug. Apologies.

For what it's worth, this is with a simple rebuild of the Fedora packaging of 2.4.2 to add IB support, running on current RHEL6 with dmtcp on both systems above installed from the same local repo. I don't have the exact build directory for the rpm now.

It may be worth keeping this open for a while, and I'll add any more information I get in case anyone else trips over it.

gc00 commented 8 years ago

Yes, let's keep this issue open until we can track down the issue.

Some systems will be very slow to close an old port that's no longer used. On the other hand, you're using --port 0, which should be a new random port. So, this seems rather mysterious.

I tested on nmi.cs.wisc.edu with RHEL6 (RedHatEnterpriseServer), and it worked for me there. Here is the build environment from that site, in case it helps to compare.

bash-4.1$ cd dmtcp
bash-4.1$ make display-build-env
DMTCP version: 2.5.0
Date built:    Tue Nov 24 12:31:57 CST 2015
config.log:    ./configure 
Description:    Red Hat Enterprise Linux Server release 6.7 (Santiago)
Codename:       Santiago
Linux exec-13.batlab.org 2.6.32-573.3.1.el6.x86_64 #1 SMP Mon Aug 10 09:44:54 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
Compiler:  gcc
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) 
CFLAGS: -g -O2
CXXFLAGS: -g -O2
CPPFLAGS: 
LDFLAGS: 
java version "1.6.0_36"
OpenJDK Runtime Environment (IcedTea6 1.13.8) (rhel-1.13.8.1.el6_7-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
lrwxrwxrwx 1 root root 12 Oct 22 17:45 /lib64/libc.so.6 -> libc-2.12.so

Would it be convenient for you to give us a temporary guest account (but don't send any information via github of course :-) ), so that we can observe this? (Alternatively, if you know of a VM where this is reproducible, we could look there.)