Closed dongahn closed 8 years ago
Hostlist file fIx added to PR #18 can help with this environment as well.
I pushed a commit into my fork just to start to assist James Southern (jsouthern@sgi.com) with porting STAT/LaunchMON on Intel Hydra for AWE: the commit is here
As you can see from here, the LaunchMON backend API expects its options are found at the end of the command line. So if there are other stuff that mpiexec hydra also append to the backend launch string, launchmon will not proceed.
I guess that's sort of the case from your email:
I realised that I can get Intel MPI to print its own command line arguments via its “-v” flag, so I can make some progress with debugging what is going on. At the moment, I set the following lines in rm_intel_hydra.conf:
RM=intel_hydra RM_MPIR=STD RM_launcher=mpiexec.hydra RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init RM_jobid=RM_launcher|sym|totalview_jobid|string RM_launch_helper=mpiexec.hydra RM_signal_for_kill=SIGINT|SIGINT RM_fail_detection=true RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c
This results in the following command line for mpiexec.hydra when running LaunchMON:
mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161
I can see that the file hostnamefn.
is created in src/linux/sdbg_linux_launchmon.cxx from the proctable, so I guess that these are the places where I need to insert my nodelist. However, there does seem to be an error with the command line. For one, it appears to specify the STATD executable twice. Should this be the case? When I try to run the mpiexec command myself, it appears that the command line as specified above results in errors (see below). When I remove the second call to STATD, however, there are no errors (although I can’t tell whether or not the daemons attach successfully since the call just waits – presumably for the next part of the LaunchMON code).
jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra –hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 <May 06 06:06:46>
(ERROR): LaunchMON-specific arguments have not been passed to the daemon through the command-line arguments. <May 06 06:06:46> (ERROR): the command line that the user provided could have been truncated. ^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested [mpiexec@r2i7n11] Press Ctrl-C again to force abort jsouthern@r2i7n11 ~/STAT $ jsouthern@r2i7n11 ~/STAT $ jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra -hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 ^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested [mpiexec@r2i7n11] Press Ctrl-C again to force abort
@jsthrn: testing for your GitHub id.
The test of my ID worked. I got the email and the link points to my profile.
James
I checked out the intel_hydra_prelim branch. Unfortunately I can't get it to build. After updating autotools, I now see the following output:
jsouthern@cy001 ~/launchmon $ CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0
configure: WARNING: unrecognized options: --with-myboost
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for pkg-config... /store/jsouthern/packages/pkg-config/0.29.1/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking whether to enable maintainer-specific portions of Makefiles... no
checking whether to turn on a workaround for slurm's MPIR_partitial_attach_ok bug... no
checking whether to enable debug codes... no
checking whether to enable verbose codes... no
./configure: line 3950: syntax error near unexpected token `1.2.0,'
./configure: line 3950: `AM_PATH_LIBGCRYPT(1.2.0,'
Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?
Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.
The full daemon command line (copied from above, but a bit more readable here!) is:
mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 \
/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 \
--exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 \
--exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD \
--lmonsharedsec=2082992184 --lmonsecchk=548371161
So, this does have the Launchmon options right at the end as required.
Note that for another application I get the following (which also has two copies of the executable - again with one at the end, so maybe that is correct?):
mpiexec.hydra -v -n 4 ./simple --exec --exec-appnum 0 --exec-proc-count 4 \
--exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 1 ./simple
Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?
The bundled grcypt has been deprecated, as the bundled version was getting older and has given problems to various packaging systems. As far as you have a decent gcrypt package installed on your system, this should be okay.
CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0 configure: WARNING: unrecognized options: --with-myboost
--with-myboost has also been deprecated as well, and a version of boost is now a requirement to build launchmon. Can you make sure the following packages are installed on your system? (What Linux distribution are you using?)
What happens if you just run once these requirements are satisfied?
% bootstrap
% CPP="gcc -E -P" --prefix=/store/jsouthern/tmp/install
./configure: line 3950: syntax error near unexpected token
1.2.0,' ./configure: line 3950:
AM_PATH_LIBGCRYPT(1.2.0,'
Did bootstrap
give you any error message about AM_PATH_LIBGCRYPT
?
Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.
OK. Thanks. Once you get to the point where you can reproduce the original problem using LaunchMON's own simple test of the new version. Let's tease apart this problem as well.
So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem. Output (with "-V" switched off in mpiexec.hydra) is:
jsouthern@r1i3n22 ~/STAT $ ps -u jsouthern
PID TTY TIME CMD
51257 pts/0 00:00:00 bash
51258 pts/0 00:00:00 pbs_demux
51317 pts/0 00:00:00 mpirun
51322 pts/0 00:00:00 mpiexec.hydra
51323 pts/0 00:00:00 pmi_proxy
51327 pts/0 00:00:19 simple
51328 pts/0 00:00:19 simple
51329 pts/0 00:00:19 simple
51330 pts/0 00:00:19 simple
51333 ? 00:00:00 sshd
51334 pts/1 00:00:00 bash
51389 pts/1 00:00:00 ps
jsouthern@r1i3n22 ~/STAT $
jsouthern@r1i3n22 ~/STAT $ stat-cl 51322
STAT started at 2016-05-11-06:56:20
Attaching to job launcher (null):51322 and launching tool daemons...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Aborted
Full output (with "-V" enabled) is shown in this file
So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem.
@jsthrn: Progress!
It is kind of difficult to see where the backend daemons die or whether they have even been launched.
Could you quickly run the configure again with the following config option and rebuild?
--enable-verbose=<log_dir>
If this works (and daemons are indeed launched and failed), running your test should dump some output files into <log_dir>
. Could you please post them here.
Also kind of curious who's returning 6 as the exit code.
I ran with --enable-verbose
. The stdout file is attached.
It looks to me like there is a problem with my munge install - which presumably isn't what we saw with the release version of Launchmon as that doesn't use munge!). I will have a look at this and see whether I can work out why the munge.socket.2 file is missing on my system.
By the way, @dongahn we are making progress with getting you access to a test system with our software stack enabled (it will be very old hardware, but that shouldn't be an issue).
So, it turned out that I hadn't started the munge daemon, which explains why that didn't work! Once I do that I get more output - and no exit code 6.
Here are the updated be.stdout and be.stderr files.
These now look more like the errors I was seeing previously, with "proc control initialization failed
" error messages.
@dongahn, I am requesting an account for you on a system now. I've already verified that Launchmon (and the rest of the STAT toolchain) builds and runs on the system.
Please let me know your preferred shell (bash, csh, tcsh, ksh, zsh) and I will submit the request.
great. tcsh should work.
Thanks. I submitted the request. Hopefully they should come back to you direct with the logon details. If not then I will forward them to you when I have them.
OK. I looked at the trace and you are much farther along with the munge fix.
Apparently the error is coming out at here. And this is because of the error percolating from the backend's procctl layer from here.
Procctl is the layer responsible for normalizing resource manager (RM)-specific synchronization mechanisms between target MPI job and the tools. RMs implement MPIR debug interface for this purpose but how they implement this is different across different RMs. So LauchMON introduced procctl layer.
Two things:
I will take a wild guess and add the case statements to help you address 1 first. Once you get pass that, you may want to get a feasibility that STAT can attach to a hung job.
Then, let's discuss what needs to be done for 2. This could be as simple as you educating me about hydra's MPI-tool synchronization mechanisms and me choosing the right procctl primitives to adjust LaunchMON to hydra.
@jsthrn: By the way, once this port is all done, it will be nice if you can provide us with your environments. As part of #25, @mcfadden8 wants to investigate how much RM-specific stuff we can integrate into Travis CI (as a separate testing instance) and ideally we want to be able to do this for as many RMs as possible, which LaunchMON supports.
Does Intel MPI require a license to use?
@dongahn Intel MPI does not require a license to run, just install. FYI, we do have it locally on LC systems (use impi-5.1.3
or peruse /usr/local/tools/impi-5.1.3).
Cool!
@jsthrn: OK. I pushed the changes to the intel_hydra_prelim
branch of my fork. Please fetch and rebase. Let me know if this helps you pass the current failure.
Drat... somehow Travis doesn't like my changes. Let me look.
I need to rebase the intel_hydra_prelim
to the current upstream master to pickup .travis.yml.
OK. Travis is happy now.
@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).
@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.
The stdout file (from --enable-verbose
) is here (the stderr file is empty).
For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one). The stdout file for that run is here (stderr was empty again).
@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).
Great! Thanks.
@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.
More progress!
For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one).
If the remote one also launched, there should be two stdout files. Do you see both?
In that case, the other one was empty. I thought that I'd run it twice by mistake and that was why there was two files.
BTW, I see lots of
couldn't find an entry with an alias r01n01... trying the next alias
I see these error messages on a system where the launcher (mpiexec.hydra
in this case) -filled MPIR_Proctable hostname isn't matched w/ what comes out of gethostname()
from a back end node.
I will have to check, but I think I have a logic that parses /etc/hosts to test the match with all of the aliases, but in the end we probably need to see the message
found an entry with an alias
if MPIR_Proctable's hostname matches w/ at least one of the alias, which is a requirement for BE to be successful.
We are probably not out of woods yet.
@jsthrn:
So, I poked around your system a bit, and I now believe that you can produce a reasonable port for your environment. However, I discovered that there is a system issue you will have to address and that you will need to add some new code to complete an Intel hydra port.
As I suspected above, this system has hostname consistency issues. As you can see from here, the launchmon backend API runtime tries hard to collect as many hostname aliases as possible for the host where it is running.
Despite this, it turned out, mpiexec.hydra
generates unmatchable backend hostnames for MPIR_Proctable -- they don't match w/ any of these aliases. For example, on the first node, hydra generates r01n01.ib0.smc-default.sgi.com
as the hostname. But the back-end-collected hostname aliases don't have this. The aliases that the backend tried to match are captured in a log file:
couldn't find an entry with an alias r01n01... trying the next alias
couldn't find an entry with an alias 10.148.0.2... trying the next alias
couldn't find an entry with an alias r01n01.smc-default.sgi.com... trying the next alias
couldn't find an entry with an alias service1... trying the next alias
It has r01n01.smc-default.sgi.com
but not r01n01.ib0.smc-default.sgi.com.
I have to think this is fixable... I am not sure if you can fix this issue by adding this ib0
alias to /etc/hosts
to each remote node. But it seems worth trying. Nevertheless, this is a system issue as opposed to a LaunchMON issue.
In addition, it appear that you will also need to augment the bulk launching string within LaunchMON to adapt it to hydra's launching options.
As is, the daemon launch string is expanded into something like:
mpiexec.hydra -v -f \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882
But because how hydra works, this will launch both of the tool daemon processes onto the first node specified in hostnamefn.8839
. I believe you can overcome this by using -machine
option instead, which contains an explicit machine to process count mapping. But this format isn't something that LaunchMON already supports.
mpiexec.hydra -v -machine \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882
cat hostnamefn.8839
r01n01:1
r01n02:1
This will require a new launching string option beyond %l
like %m,
which will then get expanded into the filename which contains that machine-proc mapping info.
Some of the relevant code can be found at here and here.
If you create a patch and submit a PR, I will review and mege it.
There will also be miscellenous work items like adding intel hydra specific code into test codes to complete the port. An example test/src/test.attach_1
I manually modified:
RM_TYPE=RC_intel_hydra
NUMNODES=1
if test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
rm -f nohup.out
fi
NUMTASKS=`expr $NUMNODES \* 16`
WAITAMOUNT=$NUMNODES
if test $NUMNODES -lt 20 ; then
WAITAMOUNT=20
fi
SIGNUM=10
MPI_JOB_LAUNCHER_PATH=/sw/sdev/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiexec.hydra
export LMON_LAUNCHMON_ENGINE_PATH=/store/dahn/workspace/stage/bin/launchmon
if test "x/store/dahn/workspace/launchmon-1c5c420/build/workspace/stage" != "x0"; then
export LMON_PREFIX=/store/dahn/workspace/stage
else
export LMON_RM_CONFIG_DIR=0
export LMON_COLOC_UTIL_DIR=0
fi
if test "x$RM_TYPE" = "xRC_slurm" ; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH -n$NUMTASKS -N$NUMNODES -ppdebug `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
WAITAMOUNT=`expr $WAITAMOUNT`
nohup $MPI_JOB_LAUNCHER_PATH -verbose 1 -np $NUMTASKS -exe `pwd`/hang_on_SIGUSR1 -cwd `pwd` &
elif test "x$RM_TYPE" = "xRC_bgqrm"; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH --verbose 4 --np $NUMTASKS --exe `pwd`/hang_on_SIGUSR1 --cwd `pwd` --env-all &
elif test "x$RM_TYPE" = "xRC_bgq_slurm"; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH -N$NUMNODES -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_alps" ; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_orte" ; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH -mca debugger mpirx -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_intel_hydra" ; then
WAITAMOUNT=`expr $WAITAMOUNT`
$MPI_JOB_LAUNCHER_PATH -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
else
echo "This RM is not supported yet"
fi
PID=`echo $!`
sleep $WAITAMOUNT #wait until the job gets stalled
./fe_attach_smoketest $PID `pwd`/be_kicker $SIGNUM
sleep $WAITAMOUNT
Finally, you will also need to add some config m4 scripts to be able to configure and build the test codes for Intel hydra. Please look at m4 files like here and here.
Hope this helps!
Thanks for the very comprehensive instructions! I will try to give this a go, but it might take me a while to get something working.
On the hostname issue, at least for SGI systems the "correct" name (or at least one that will be valid) is always the bit before the first dot (e.g. r1i0n0
). Would it be possible to trim the hostname returned by hydra and then use that? I just worry that it will be difficult for individual users to change files like /etc/hostname
to include the full alias. Alternatively, since the .ib0.
part of the hostname comes from the PBS nodefile, maybe we can parse that before running the application?
There is always a danger if you do the match test only based on the first name. Two different names can be matched as identical.
It feels to me that we probaby don't to introduce that as a default match test. But it seems ok if you add this as an additional partital test and only do thisif the fully qualified tests all fail?
It would also be nice if we can make such test as a config time option through platform_compat
So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts
is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.
If I run like this:
cat $PBS_NODEFILE | sed 's/.ib0//g' > nodefile
export PBS_NODEFILE=${PWD}/nodefile
mpirun -n 4 ./simple
Then the log file entries change to:
couldn't find an entry with an alias r01n03... trying the next alias
couldn't find an entry with an alias 10.148.0.4... trying the next alias
found an entry with an alias r01n03.smc-default.sgi.com.
To launch the daemons on the correct nodes, I think that -ppn 1
can be used instead of -machine
. This specifies that one process should be launched on each host specified in the hostfile - which I think is what is required.
By altering the rm_intel_hydra.conf
file to use this option I can see the daemon's launching on the correct nodes. However, the daemon launched on the remote node does not seem to run properly. The output looks like this:
[proxy:0:0@r01n03] Start PMI_proxy 0
[proxy:0:0@r01n03] STDIN will be redirected to 1 fd(s): 25
[handshake.c:186] - Starting handshake from client
[handshake.c:1125] - Looking up server and client addresses for socket 7
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 64470, uid = 48837, gid = 100, session_id = 10, signature = 9b1cc028
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
[handshake.c:548] - Munge encoded packet successfully
[handshake.c:331] - Encrypted packet to buffer of size 212
[handshake.c:1182] - Sending packet size on network
[handshake.c:1190] - Sending packet on network
[handshake.c:1205] - Receiving packet size from network
[handshake.c:1211] - Received packet size 212
[handshake.c:1224] - Received packet from network
[handshake.c:358] - Creating an expected packet
[handshake.c:371] - Decrypting and checking packet
[handshake.c:825] - Decrypting and checking packet with munge
[handshake.c:1071] - Packets compared equal.
[handshake.c:379] - Successfully completed initial handshake
[handshake.c:1094] - Sharing handshake result 0 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 0
[handshake.c:277] - Completed server handshake. Result = 0
[proxy:0:1@r01n04] Start PMI_proxy 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 25233 RUNNING AT r01n04.smc-default.sgi.com
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
<May 26 08:26:02> <LMON FE API> (ERROR): Received an invalid LMONP msg: Front-end back-end protocol mismatch? or back-end disconnected?
<May 26 08:26:02> <LMON FE API> (ERROR): A proper msg of {Class(lmonp_febe_security_chk),Type(32767),LMON_payload_size()} is expected.lmonp_fetobe
<May 26 08:26:02> <LMON FE API> (ERROR): A msg of {Class((null)),Type((null)),LMON_payload_size(6361488)} has been received.
<May 26 08:26:02> <STAT_FrontEnd.C: 586> STAT returned error type STAT_LMON_ERROR: Failed to attach to job launcher and spawn daemons
<May 26 08:26:02> <STAT_FrontEnd.C: 442> STAT returned error type STAT_LMON_ERROR: Failed to attach and spawn daemons
<May 26 08:26:02> <STAT.C: 152> STAT returned error type STAT_LMON_ERROR: Failed to launch MRNet tree()
<May 26 08:26:02> <STAT_FrontEnd.C: 3294> STAT returned error type STAT_FILE_ERROR: Output directory not created. Performance results not written.
<May 26 08:26:02> <STAT_FrontEnd.C: 3417> STAT returned error type STAT_FILE_ERROR: Failed to dump performance results
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 25233 RUNNING AT r01n04.smc-default.sgi.com
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Looking at the output file from the remote node, it looks like the problems is with munge:
[handshake.c:180] - Starting handshake from server
[handshake.c:1125] - Looking up server and client addresses for socket 6
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 48772, uid = 48837, gid = 100, session_id = 10, signature = 67ad047e
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
ERROR: [handshake.c:541] - Munge failed to encrypt packet with error: Failed to connect to "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2": Connection refused
[handshake.c:327] - Error in server encrypting outgoing packet[handshake.c:1094] - Sharing handshake result -2 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 212
[handshake.c:277] - Completed server handshake. Result = -1
It seems that I can't start munge on more than one node as I get errors like:
jsouthern@r01n04:~/STAT $ /store/jsouthern/packages/munge/0.5.12/etc/init.d/munge start
redirecting to systemctl start .service
Starting MUNGE: munged failed
munged: Error: Found inconsistent state for lock "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2.lock"
Is this something that you have seen before @dongahn? Is there a way to start munged
across all nodes at the same time?
So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.
Is there anyway to make this transparent for the users? Users having to remember this seems like an usability problem.
Is this something that you have seen before @dongahn? Is there a way to start munged across all nodes at the same time?
I actually removed the secure handshake from tools/handshake
for my quick validation on your system. So, I haven't seen this. You can see #if 0
macros from the source file under that directory if you have access to my local copy on your system.
Actually --enable-sec-none
config option should disable secure handshake for quick testing. But somehow I wasn't able to get this option to work on your system. But I tried this only once and didn't spend time to look at what was wrong. This was implemented by @mplegendre, if you see issues with that option, please send that along the way.
For quick testing/progress, though, I recommend you to manual disable the secure handshake like I did in my local copy.
Yeah, having users make manual alterations to PBS_NODEFILE
does seem to be a bit fragile. Long term I think that the solution will be to get the hostname including ib0
included in /etc/hosts
. But I can see that being a slow process in terms of rolling out the software to do that - especially for existing customers who probably don't update very often. So, maybe I do need to go back and look at falling back to a partial match.
I will have a look at --enable-sec-none
to disable the secure handshake and get back to you with any progress.
@jsthrn: Thanks James!
It looks like the modified code runs to completion when configured with --enable-sec-none
. And I get plots that look like this:
So, I think that is successful... :-)
Very nice, that STAT output looks correct. Good job!
Ditto!
BTW when you say the modified code, did you mean my local copy with some section in the handshake src commented out? In theory --enable-sec-none should not require code mods. Did you try this w/o the mods?
The modified code is my local copy. So no sections in the handshake commented out. The only code modification I have made is to add the -ppn
option to etc/rm_intel_hydra.conf
(I also run with the modified nodefile as discussed above).
jsouthern@cy013:~/launchmon $ git status
On branch intel_hydra_prelim
Your branch is up-to-date with 'origin/intel_hydra_prelim'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: etc/rm_intel_hydra.conf
no changes added to commit (use "git add" and/or "git commit -a")
jsouthern@cy013:~/launchmon $ git --no-pager diff
diff --git a/etc/rm_intel_hydra.conf b/etc/rm_intel_hydra.conf
index 8fe5248..653f509 100644
--- a/etc/rm_intel_hydra.conf
+++ b/etc/rm_intel_hydra.conf
@@ -51,4 +51,4 @@ RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
RM_launch_helper=mpiexec.hydra
RM_signal_for_kill=SIGINT|SIGINT
RM_fail_detection=true
-RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c
+RM_launch_str=-v -f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c
@dongahn, I have some commits on the intel_hydra_prelim
branch that implement adding SGI hostnames and enabling the use of these via a configure flag.
This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?
@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1
fails).
The launch tests fail with errors like:
[mpiexec@r01n01] HYDU_parse_hostfile (../../utils/args/args.c:535): unable to open host file: nodelist
So, it looks like mpiexec.hydra
is looking for a nodelist (command line argument -f
) which is not present.
All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf
that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.
jsouthern@cy013:~/launchmon $ cat etc/rm_intel_hydra.conf
## $Header: $
##
## rm_intel_hydra.conf
##
##--------------------------------------------------------------------------------
## Copyright (c) 2008, Lawrence Livermore National Security, LLC. Produced at
## the Lawrence Livermore National Laboratory. Written by Dong H. Ahn <ahn1@llnl.gov>.
## LLNL-CODE-409469. All rights reserved.
##
## This file is part of LaunchMON. For details, see
## https://computing.llnl.gov/?set=resources&page=os_projects
##
## Please also read LICENSE -- Our Notice and GNU Lesser General Public License.
##
##
## This program is free software; you can redistribute it and/or modify it under the
## terms of the GNU General Public License (as published by the Free Software
## Foundation) version 2.1 dated February 1999.
##
## This program is distributed in the hope that it will be useful, but WITHOUT ANY
## WARRANTY; without even the IMPLIED WARRANTY OF MERCHANTABILITY or
## FITNESS FOR A PARTICULAR PURPOSE. See the terms and conditions of the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU Lesser General Public License along
## with this program; if not, write to the Free Software Foundation, Inc., 59 Temple
## Place, Suite 330, Boston, MA 02111-1307 USA
##--------------------------------------------------------------------------------
##
## Update Log:
## May 05 2016 DHA: Created file.
##
##
## RM: the name of Resource Manager
## RM_launcher: the name of the launcher command
## RM_launcher_id: the rule to get the launcher id
## (e.g., RM_launcher|sym|srun says the launcher is identify by testing
## RM_launcher's symbol by the name of srun)
## RM_jobid: the rule to get the target jobid
## (e.g., RM_jobid=RM_launcher|sym|totalview_jobid|string says
## jobid can be obtained from the launcher's symbol, totalview_jobid,
## interpreting that as the string type.
## RM_launcher_helper= method or command to launch daemons
## RM_launch_str= options and arguements used for RM_launch_mth.
##
RM=intel_hydra
RM_MPIR=STD
RM_launcher=mpiexec.hydra
RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
RM_launch_helper=mpiexec.hydra
RM_signal_for_kill=SIGINT|SIGINT
RM_fail_detection=true
RM_launch_str=-f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c
@dongahn, I have some commits on the intel_hydra_prelim branch that implement adding SGI hostnames and enabling the use of these via a configure flag.
This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?
@jsthrn: Sorry for the late response. So I sent you a collaborator request. Up on accepting it, you should have a push privilege, I think.
@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1 fails).
The launch tests fail with errors like:
So, it looks like mpiexec.hydra is looking for a nodelist (command line argument -f) which is not present.
All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.
The rm configuration file looks reasonable to me although you will probably test whether sending two consecutive SIGINT
s is the right sequence to kill the target job cleanly in Hydra. Different RMs can have different ways to "cleanly" kill the job and you have to adjust your configuration for Hydra.
In addition, test.launch_6_engine_failure
should allow you to manually test the various failure semantics. The semantics is documented in here.
Now, when I tested for feasibility on your system for launch mode, I was able to get test.launch_1
to work. So I don't think there is anything fundamentally wrong. At the point where this test is ready to launch the tool daemons, the hostname file should have been generated and -f %l
should be expanded into a valid string.
If the complain about -f
comes from the launching string of the target application itself, IOW, the MPI application, that's a different story.
The front-end test code (test/src/fe_launch_smoketest.cxx
) I used for testing actually used -f nodelist
to test whether mpiexec.hydra
knows how to launch a job using the manually written nodelist.
Your port shouldn't use that flag. Instead whatever the set of flags you will use to launch an MPI application under an interactive batch allocation should the ones you should type into the front-end test code. Hope this helps...
Thanks. I changed test/src/fe_launch_smoketest.cxx
to launch with mpiexec.hydra -n <numprocs>
. I think that this is the correct set of flags under an interactive batch allocation (it works for me). The only slight issue might be with cases like test.launch_2_half
, where all the MPI processes run on the first (of two) nodes. I am not sure if they are supposed to be split equally between the two.
Pressing <Ctrl-C>
twice does seem to be the correct sequence to kill the target cleanly.
I have submitted a pull request containing my changes. I am not sure exactly what the correct behaviour for all of the tests is, but I think that most pass. Issues that I am aware of include:
test.attach_1_pdebugmax
: Runs (and passes the test), but does not terminate (basically keeps printing APP (INFO): stall for 3 secs
indefinitely.test.launch_mw_1_hostlist
and test.launch_mw_5_mixall
: Complete initial handshake, but then respond with cy013.ib0.smc-default.sgi.com: Connection refused
and the tests do not appear to continue (although the application does resume). cy013 is the cluster head node (where compilation occurs, but no MPI processes run). This behaviour is not seen for test.launch_mw_2_coloc
, which does pass.test.attach_3_*
: All fail with output like including <LMON FE API> (ERROR): the launchmon engine encountered an error while parsing its command line.
and <LMON FE API> (ERROR): LMON_fe_acceptEngine failed
. However, these look like they may be expected fails.test.launch_3_invalid_dmonpath
: Also may be an expected fail. Test outputs <OptionParser> (ERROR): the path[/invalid/be_kicker] does not exit.
and then fails.
There is a out of band communication to port LaunchMON to Intel MPI with Hydra environment. Created this ticket to capture any significant issues that may arise for that effort.