Assist SGI to port to Intel MPI with Hydra launcher

dongahn commented 8 years ago

There is a out of band communication to port LaunchMON to Intel MPI with Hydra environment. Created this ticket to capture any significant issues that may arise for that effort.

dongahn commented 8 years ago

Hostlist file fIx added to PR #18 can help with this environment as well.

dongahn commented 8 years ago

I pushed a commit into my fork just to start to assist James Southern (jsouthern@sgi.com) with porting STAT/LaunchMON on Intel Hydra for AWE: the commit is here

dongahn commented 8 years ago

As you can see from here, the LaunchMON backend API expects its options are found at the end of the command line. So if there are other stuff that mpiexec hydra also append to the backend launch string, launchmon will not proceed.

I guess that's sort of the case from your email:

I realised that I can get Intel MPI to print its own command line arguments via its “-v” flag, so I can make some progress with debugging what is going on. At the moment, I set the following lines in rm_intel_hydra.conf:

RM=intel_hydra RM_MPIR=STD RM_launcher=mpiexec.hydra RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init RM_jobid=RM_launcher|sym|totalview_jobid|string RM_launch_helper=mpiexec.hydra RM_signal_for_kill=SIGINT|SIGINT RM_fail_detection=true RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c

This results in the following command line for mpiexec.hydra when running LaunchMON:

mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161

I can see that the file hostnamefn. is created in src/linux/sdbg_linux_launchmon.cxx from the proctable, so I guess that these are the places where I need to insert my nodelist. However, there does seem to be an error with the command line. For one, it appears to specify the STATD executable twice. Should this be the case?

When I try to run the mpiexec command myself, it appears that the command line as specified above results in errors (see below). When I remove the second call to STATD, however, there are no errors (although I can’t tell whether or not the daemons attach successfully since the call just waits – presumably for the next part of the LaunchMON code).

jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra –hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 <May 06 06:06:46> (ERROR): LaunchMON-specific arguments have not been passed to the daemon through the command-line arguments. <May 06 06:06:46> (ERROR): the command line that the user provided could have been truncated. ^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested [mpiexec@r2i7n11] Press Ctrl-C again to force abort jsouthern@r2i7n11 ~/STAT $ jsouthern@r2i7n11 ~/STAT $ jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra -hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 ^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested [mpiexec@r2i7n11] Press Ctrl-C again to force abort

stonydon commented 8 years ago

@jsthrn: testing for your GitHub id.

jsthrn commented 8 years ago

The test of my ID worked. I got the email and the link points to my profile.

James

jsthrn commented 8 years ago

I checked out the intel_hydra_prelim branch. Unfortunately I can't get it to build. After updating autotools, I now see the following output:

jsouthern@cy001 ~/launchmon $ CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0
configure: WARNING: unrecognized options: --with-myboost
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for pkg-config... /store/jsouthern/packages/pkg-config/0.29.1/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking whether to enable maintainer-specific portions of Makefiles... no
checking whether to turn on a workaround for slurm's MPIR_partitial_attach_ok bug... no
checking whether to enable debug codes... no
checking whether to enable verbose codes... no
./configure: line 3950: syntax error near unexpected token `1.2.0,'
./configure: line 3950: `AM_PATH_LIBGCRYPT(1.2.0,'

Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?

jsthrn commented 8 years ago

Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.

The full daemon command line (copied from above, but a bit more readable here!) is:

mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 \
    /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 \
    --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 \
    --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD \
    --lmonsharedsec=2082992184 --lmonsecchk=548371161

So, this does have the Launchmon options right at the end as required.

Note that for another application I get the following (which also has two copies of the executable - again with one at the end, so maybe that is correct?):

mpiexec.hydra -v -n 4 ./simple  --exec --exec-appnum 0 --exec-proc-count 4 \
    --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 1 ./simple

dongahn commented 8 years ago

Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?

The bundled grcypt has been deprecated, as the bundled version was getting older and has given problems to various packaging systems. As far as you have a decent gcrypt package installed on your system, this should be okay.

CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0 configure: WARNING: unrecognized options: --with-myboost

--with-myboost has also been deprecated as well, and a version of boost is now a requirement to build launchmon. Can you make sure the following packages are installed on your system? (What Linux distribution are you using?)

libelf-dev
libboost-dev
munge (this is required for secure handshake. There is a config option that allows you to test LaunchMON without this though)

What happens if you just run once these requirements are satisfied?

% bootstrap
% CPP="gcc -E -P" --prefix=/store/jsouthern/tmp/install

dongahn commented 8 years ago

./configure: line 3950: syntax error near unexpected token 1.2.0,' ./configure: line 3950:AM_PATH_LIBGCRYPT(1.2.0,'

Did bootstrap give you any error message about AM_PATH_LIBGCRYPT?

dongahn commented 8 years ago

Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.

OK. Thanks. Once you get to the point where you can reproduce the original problem using LaunchMON's own simple test of the new version. Let's tease apart this problem as well.

jsthrn commented 8 years ago

So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem. Output (with "-V" switched off in mpiexec.hydra) is:

jsouthern@r1i3n22 ~/STAT $ ps -u jsouthern
  PID TTY          TIME CMD
51257 pts/0    00:00:00 bash
51258 pts/0    00:00:00 pbs_demux
51317 pts/0    00:00:00 mpirun
51322 pts/0    00:00:00 mpiexec.hydra
51323 pts/0    00:00:00 pmi_proxy
51327 pts/0    00:00:19 simple
51328 pts/0    00:00:19 simple
51329 pts/0    00:00:19 simple
51330 pts/0    00:00:19 simple
51333 ?        00:00:00 sshd
51334 pts/1    00:00:00 bash
51389 pts/1    00:00:00 ps
jsouthern@r1i3n22 ~/STAT $
jsouthern@r1i3n22 ~/STAT $ stat-cl 51322
STAT started at 2016-05-11-06:56:20
Attaching to job launcher (null):51322 and launching tool daemons...

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Aborted

Full output (with "-V" enabled) is shown in this file

dongahn commented 8 years ago

So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem.

@jsthrn: Progress!

It is kind of difficult to see where the backend daemons die or whether they have even been launched.

Could you quickly run the configure again with the following config option and rebuild?

--enable-verbose=<log_dir>

If this works (and daemons are indeed launched and failed), running your test should dump some output files into <log_dir>. Could you please post them here.

Also kind of curious who's returning 6 as the exit code.

jsthrn commented 8 years ago

I ran with --enable-verbose. The stdout file is attached.

It looks to me like there is a problem with my munge install - which presumably isn't what we saw with the release version of Launchmon as that doesn't use munge!). I will have a look at this and see whether I can work out why the munge.socket.2 file is missing on my system.

By the way, @dongahn we are making progress with getting you access to a test system with our software stack enabled (it will be very old hardware, but that shouldn't be an issue).

jsthrn commented 8 years ago

So, it turned out that I hadn't started the munge daemon, which explains why that didn't work! Once I do that I get more output - and no exit code 6.

Here are the updated be.stdout and be.stderr files.

These now look more like the errors I was seeing previously, with "proc control initialization failed" error messages.

jsthrn commented 8 years ago

@dongahn, I am requesting an account for you on a system now. I've already verified that Launchmon (and the rest of the STAT toolchain) builds and runs on the system.

Please let me know your preferred shell (bash, csh, tcsh, ksh, zsh) and I will submit the request.

dongahn commented 8 years ago

great. tcsh should work.

jsthrn commented 8 years ago

Thanks. I submitted the request. Hopefully they should come back to you direct with the logon details. If not then I will forward them to you when I have them.

dongahn commented 8 years ago

OK. I looked at the trace and you are much farther along with the munge fix.

Apparently the error is coming out at here. And this is because of the error percolating from the backend's procctl layer from here.

Procctl is the layer responsible for normalizing resource manager (RM)-specific synchronization mechanisms between target MPI job and the tools. RMs implement MPIR debug interface for this purpose but how they implement this is different across different RMs. So LauchMON introduced procctl layer.

Two things:

Your current test case is STAT's attach mode. This is a simpler case to handle in terms of MPI-tool synchronization. In fact, I think the error you are seeing can be addressed by adding Hydra-specific case within switch statements across procctl functions.
To complete the port, however, we will need to address launch mode, which is a bit more complex.

I will take a wild guess and add the case statements to help you address 1 first. Once you get pass that, you may want to get a feasibility that STAT can attach to a hung job.

Then, let's discuss what needs to be done for 2. This could be as simple as you educating me about hydra's MPI-tool synchronization mechanisms and me choosing the right procctl primitives to adjust LaunchMON to hydra.

dongahn commented 8 years ago

@jsthrn: By the way, once this port is all done, it will be nice if you can provide us with your environments. As part of #25, @mcfadden8 wants to investigate how much RM-specific stuff we can integrate into Travis CI (as a separate testing instance) and ideally we want to be able to do this for as many RMs as possible, which LaunchMON supports.

Does Intel MPI require a license to use?

lee218llnl commented 8 years ago

@dongahn Intel MPI does not require a license to run, just install. FYI, we do have it locally on LC systems (use impi-5.1.3 or peruse /usr/local/tools/impi-5.1.3).

dongahn commented 8 years ago

Cool!

dongahn commented 8 years ago

@jsthrn: OK. I pushed the changes to the intel_hydra_prelim branch of my fork. Please fetch and rebase. Let me know if this helps you pass the current failure.

dongahn commented 8 years ago

Drat... somehow Travis doesn't like my changes. Let me look.

dongahn commented 8 years ago

I need to rebase the intel_hydra_prelim to the current upstream master to pickup .travis.yml.

dongahn commented 8 years ago

OK. Travis is happy now.

jsthrn commented 8 years ago

@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).

jsthrn commented 8 years ago

@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.

The stdout file (from --enable-verbose) is here (the stderr file is empty).

For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one). The stdout file for that run is here (stderr was empty again).

dongahn commented 8 years ago

@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).

Great! Thanks.

dongahn commented 8 years ago

@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.

More progress!

For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one).

If the remote one also launched, there should be two stdout files. Do you see both?

jsthrn commented 8 years ago

In that case, the other one was empty. I thought that I'd run it twice by mistake and that was why there was two files.

dongahn commented 8 years ago

BTW, I see lots of

couldn't find an entry with an alias r01n01... trying the next alias

I see these error messages on a system where the launcher (mpiexec.hydra in this case) -filled MPIR_Proctable hostname isn't matched w/ what comes out of gethostname() from a back end node.

I will have to check, but I think I have a logic that parses /etc/hosts to test the match with all of the aliases, but in the end we probably need to see the message

found an entry with an alias

if MPIR_Proctable's hostname matches w/ at least one of the alias, which is a requirement for BE to be successful.

We are probably not out of woods yet.

dongahn commented 8 years ago

@jsthrn:

So, I poked around your system a bit, and I now believe that you can produce a reasonable port for your environment. However, I discovered that there is a system issue you will have to address and that you will need to add some new code to complete an Intel hydra port.

As I suspected above, this system has hostname consistency issues. As you can see from here, the launchmon backend API runtime tries hard to collect as many hostname aliases as possible for the host where it is running.

Despite this, it turned out, mpiexec.hydra generates unmatchable backend hostnames for MPIR_Proctable -- they don't match w/ any of these aliases. For example, on the first node, hydra generates r01n01.ib0.smc-default.sgi.com as the hostname. But the back-end-collected hostname aliases don't have this. The aliases that the backend tried to match are captured in a log file:

couldn't find an entry with an alias r01n01... trying the next alias
couldn't find an entry with an alias 10.148.0.2... trying the next alias
couldn't find an entry with an alias r01n01.smc-default.sgi.com... trying the next alias
couldn't find an entry with an alias service1... trying the next alias

It has r01n01.smc-default.sgi.com but not r01n01.ib0.smc-default.sgi.com.

I have to think this is fixable... I am not sure if you can fix this issue by adding this ib0 alias to /etc/hosts to each remote node. But it seems worth trying. Nevertheless, this is a system issue as opposed to a LaunchMON issue.

In addition, it appear that you will also need to augment the bulk launching string within LaunchMON to adapt it to hydra's launching options.

As is, the daemon launch string is expanded into something like:

mpiexec.hydra -v -f \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882

But because how hydra works, this will launch both of the tool daemon processes onto the first node specified in hostnamefn.8839. I believe you can overcome this by using -machine option instead, which contains an explicit machine to process count mapping. But this format isn't something that LaunchMON already supports.

mpiexec.hydra -v -machine \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882

cat hostnamefn.8839
r01n01:1
r01n02:1

This will require a new launching string option beyond %l like %m, which will then get expanded into the filename which contains that machine-proc mapping info.

Some of the relevant code can be found at here and here.

If you create a patch and submit a PR, I will review and mege it.

There will also be miscellenous work items like adding intel hydra specific code into test codes to complete the port. An example test/src/test.attach_1 I manually modified:

RM_TYPE=RC_intel_hydra
NUMNODES=1

if test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
  rm -f nohup.out
fi

NUMTASKS=`expr $NUMNODES \* 16`

WAITAMOUNT=$NUMNODES 
if test $NUMNODES -lt 20 ; then 
  WAITAMOUNT=20
fi 

SIGNUM=10
MPI_JOB_LAUNCHER_PATH=/sw/sdev/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiexec.hydra
export LMON_LAUNCHMON_ENGINE_PATH=/store/dahn/workspace/stage/bin/launchmon
if test "x/store/dahn/workspace/launchmon-1c5c420/build/workspace/stage" != "x0"; then
    export LMON_PREFIX=/store/dahn/workspace/stage
else
    export LMON_RM_CONFIG_DIR=0
    export LMON_COLOC_UTIL_DIR=0
fi

if test "x$RM_TYPE" = "xRC_slurm" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -n$NUMTASKS -N$NUMNODES -ppdebug `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  nohup $MPI_JOB_LAUNCHER_PATH -verbose 1 -np $NUMTASKS -exe `pwd`/hang_on_SIGUSR1 -cwd `pwd` &
elif test "x$RM_TYPE" = "xRC_bgqrm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH --verbose 4 --np $NUMTASKS --exe `pwd`/hang_on_SIGUSR1 --cwd `pwd` --env-all &
elif test "x$RM_TYPE" = "xRC_bgq_slurm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -N$NUMNODES -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_alps" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_orte" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -mca debugger mpirx -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_intel_hydra" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
else
  echo "This RM is not supported yet" 
fi

PID=`echo $!` 

sleep $WAITAMOUNT #wait until the job gets stalled 

./fe_attach_smoketest $PID `pwd`/be_kicker $SIGNUM 

sleep $WAITAMOUNT

Finally, you will also need to add some config m4 scripts to be able to configure and build the test codes for Intel hydra. Please look at m4 files like here and here.

Hope this helps!

jsthrn commented 8 years ago

Thanks for the very comprehensive instructions! I will try to give this a go, but it might take me a while to get something working.

On the hostname issue, at least for SGI systems the "correct" name (or at least one that will be valid) is always the bit before the first dot (e.g. r1i0n0). Would it be possible to trim the hostname returned by hydra and then use that? I just worry that it will be difficult for individual users to change files like /etc/hostname to include the full alias. Alternatively, since the .ib0. part of the hostname comes from the PBS nodefile, maybe we can parse that before running the application?

dongahn commented 8 years ago

There is always a danger if you do the match test only based on the first name. Two different names can be matched as identical.

It feels to me that we probaby don't to introduce that as a default match test. But it seems ok if you add this as an additional partital test and only do thisif the fully qualified tests all fail?

It would also be nice if we can make such test as a config time option through platform_compat

jsthrn commented 8 years ago

So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.

If I run like this:

cat $PBS_NODEFILE | sed 's/.ib0//g' > nodefile
export PBS_NODEFILE=${PWD}/nodefile
mpirun -n 4 ./simple

Then the log file entries change to:

couldn't find an entry with an alias r01n03... trying the next alias
couldn't find an entry with an alias 10.148.0.4... trying the next alias
found an entry with an alias r01n03.smc-default.sgi.com.

jsthrn commented 8 years ago

To launch the daemons on the correct nodes, I think that -ppn 1 can be used instead of -machine. This specifies that one process should be launched on each host specified in the hostfile - which I think is what is required.

By altering the rm_intel_hydra.conf file to use this option I can see the daemon's launching on the correct nodes. However, the daemon launched on the remote node does not seem to run properly. The output looks like this:

[proxy:0:0@r01n03] Start PMI_proxy 0
[proxy:0:0@r01n03] STDIN will be redirected to 1 fd(s): 25
[handshake.c:186] - Starting handshake from client
[handshake.c:1125] - Looking up server and client addresses for socket 7
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 64470, uid = 48837, gid = 100, session_id = 10, signature = 9b1cc028
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
[handshake.c:548] - Munge encoded packet successfully
[handshake.c:331] - Encrypted packet to buffer of size 212
[handshake.c:1182] - Sending packet size on network
[handshake.c:1190] - Sending packet on network
[handshake.c:1205] - Receiving packet size from network
[handshake.c:1211] - Received packet size 212
[handshake.c:1224] - Received packet from network
[handshake.c:358] - Creating an expected packet
[handshake.c:371] - Decrypting and checking packet
[handshake.c:825] - Decrypting and checking packet with munge
[handshake.c:1071] - Packets compared equal.
[handshake.c:379] - Successfully completed initial handshake
[handshake.c:1094] - Sharing handshake result 0 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 0
[handshake.c:277] - Completed server handshake.  Result = 0
[proxy:0:1@r01n04] Start PMI_proxy 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25233 RUNNING AT r01n04.smc-default.sgi.com
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
<May 26 08:26:02> <LMON FE API> (ERROR): Received an invalid LMONP msg: Front-end back-end protocol mismatch? or back-end disconnected?
<May 26 08:26:02> <LMON FE API> (ERROR):   A proper msg of {Class(lmonp_febe_security_chk),Type(32767),LMON_payload_size()} is expected.lmonp_fetobe
<May 26 08:26:02> <LMON FE API> (ERROR):   A msg of {Class((null)),Type((null)),LMON_payload_size(6361488)} has been received.
<May 26 08:26:02> <STAT_FrontEnd.C: 586> STAT returned error type STAT_LMON_ERROR: Failed to attach to job launcher and spawn daemons
<May 26 08:26:02> <STAT_FrontEnd.C: 442> STAT returned error type STAT_LMON_ERROR: Failed to attach and spawn daemons
<May 26 08:26:02> <STAT.C: 152> STAT returned error type STAT_LMON_ERROR: Failed to launch MRNet tree()
<May 26 08:26:02> <STAT_FrontEnd.C: 3294> STAT returned error type STAT_FILE_ERROR: Output directory not created.  Performance results not written.
<May 26 08:26:02> <STAT_FrontEnd.C: 3417> STAT returned error type STAT_FILE_ERROR: Failed to dump performance results

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25233 RUNNING AT r01n04.smc-default.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Looking at the output file from the remote node, it looks like the problems is with munge:

[handshake.c:180] - Starting handshake from server
[handshake.c:1125] - Looking up server and client addresses for socket 6
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 48772, uid = 48837, gid = 100, session_id = 10, signature = 67ad047e
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
ERROR: [handshake.c:541] - Munge failed to encrypt packet with error: Failed to connect to "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2": Connection refused
[handshake.c:327] - Error in server encrypting outgoing packet[handshake.c:1094] - Sharing handshake result -2 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 212
[handshake.c:277] - Completed server handshake.  Result = -1

It seems that I can't start munge on more than one node as I get errors like:

jsouthern@r01n04:~/STAT $ /store/jsouthern/packages/munge/0.5.12/etc/init.d/munge start
redirecting to systemctl start .service
Starting MUNGE: munged                                                           failed
munged: Error: Found inconsistent state for lock "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2.lock"

Is this something that you have seen before @dongahn? Is there a way to start munged across all nodes at the same time?

dongahn commented 8 years ago

So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.

Is there anyway to make this transparent for the users? Users having to remember this seems like an usability problem.

dongahn commented 8 years ago

Is this something that you have seen before @dongahn? Is there a way to start munged across all nodes at the same time?

I actually removed the secure handshake from tools/handshake for my quick validation on your system. So, I haven't seen this. You can see #if 0 macros from the source file under that directory if you have access to my local copy on your system.

Actually --enable-sec-none config option should disable secure handshake for quick testing. But somehow I wasn't able to get this option to work on your system. But I tried this only once and didn't spend time to look at what was wrong. This was implemented by @mplegendre, if you see issues with that option, please send that along the way.

For quick testing/progress, though, I recommend you to manual disable the secure handshake like I did in my local copy.

jsthrn commented 8 years ago

Yeah, having users make manual alterations to PBS_NODEFILE does seem to be a bit fragile. Long term I think that the solution will be to get the hostname including ib0 included in /etc/hosts. But I can see that being a slow process in terms of rolling out the software to do that - especially for existing customers who probably don't update very often. So, maybe I do need to go back and look at falling back to a partial match.

I will have a look at --enable-sec-none to disable the secure handshake and get back to you with any progress.

dongahn commented 8 years ago

@jsthrn: Thanks James!

jsthrn commented 8 years ago

It looks like the modified code runs to completion when configured with --enable-sec-none. And I get plots that look like this:

00_simple.pdf

So, I think that is successful... :-)

lee218llnl commented 8 years ago

Very nice, that STAT output looks correct. Good job!

dongahn commented 8 years ago

Ditto!

dongahn commented 8 years ago

BTW when you say the modified code, did you mean my local copy with some section in the handshake src commented out? In theory --enable-sec-none should not require code mods. Did you try this w/o the mods?

jsthrn commented 8 years ago

The modified code is my local copy. So no sections in the handshake commented out. The only code modification I have made is to add the -ppn option to etc/rm_intel_hydra.conf (I also run with the modified nodefile as discussed above).

jsouthern@cy013:~/launchmon $ git status
On branch intel_hydra_prelim
Your branch is up-to-date with 'origin/intel_hydra_prelim'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   etc/rm_intel_hydra.conf

no changes added to commit (use "git add" and/or "git commit -a")
jsouthern@cy013:~/launchmon $ git --no-pager diff
diff --git a/etc/rm_intel_hydra.conf b/etc/rm_intel_hydra.conf
index 8fe5248..653f509 100644
--- a/etc/rm_intel_hydra.conf
+++ b/etc/rm_intel_hydra.conf
@@ -51,4 +51,4 @@ RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
 RM_launch_helper=mpiexec.hydra
 RM_signal_for_kill=SIGINT|SIGINT
 RM_fail_detection=true
-RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c
+RM_launch_str=-v -f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c

jsthrn commented 8 years ago

@dongahn, I have some commits on the intel_hydra_prelim branch that implement adding SGI hostnames and enabling the use of these via a configure flag.

This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?

jsthrn commented 8 years ago

@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1 fails).

The launch tests fail with errors like:

[mpiexec@r01n01] HYDU_parse_hostfile (../../utils/args/args.c:535): unable to open host file: nodelist

So, it looks like mpiexec.hydra is looking for a nodelist (command line argument -f) which is not present.

All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.

jsouthern@cy013:~/launchmon $ cat etc/rm_intel_hydra.conf
## $Header: $
##
## rm_intel_hydra.conf
##
##--------------------------------------------------------------------------------
## Copyright (c) 2008, Lawrence Livermore National Security, LLC. Produced at
## the Lawrence Livermore National Laboratory. Written by Dong H. Ahn <ahn1@llnl.gov>.
## LLNL-CODE-409469. All rights reserved.
##
## This file is part of LaunchMON. For details, see
## https://computing.llnl.gov/?set=resources&page=os_projects
##
## Please also read LICENSE -- Our Notice and GNU Lesser General Public License.
##
##
## This program is free software; you can redistribute it and/or modify it under the
## terms of the GNU General Public License (as published by the Free Software
## Foundation) version 2.1 dated February 1999.
##
## This program is distributed in the hope that it will be useful, but WITHOUT ANY
## WARRANTY; without even the IMPLIED WARRANTY OF MERCHANTABILITY or
## FITNESS FOR A PARTICULAR PURPOSE. See the terms and conditions of the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU Lesser General Public License along
## with this program; if not, write to the Free Software Foundation, Inc., 59 Temple
## Place, Suite 330, Boston, MA 02111-1307 USA
##--------------------------------------------------------------------------------
##
##  Update Log:
##        May 05 2016 DHA: Created file.
##
##
## RM: the name of Resource Manager
## RM_launcher: the name of the launcher command
## RM_launcher_id: the rule to get the launcher id
## (e.g., RM_launcher|sym|srun says the launcher is identify by testing
##        RM_launcher's symbol by the name of srun)
## RM_jobid: the rule to get the target jobid
## (e.g., RM_jobid=RM_launcher|sym|totalview_jobid|string says
##        jobid can be obtained from the launcher's symbol, totalview_jobid,
##        interpreting that as the string type.
## RM_launcher_helper= method or command to launch daemons
## RM_launch_str= options and arguements used for RM_launch_mth.
##

RM=intel_hydra
RM_MPIR=STD
RM_launcher=mpiexec.hydra
RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
RM_launch_helper=mpiexec.hydra
RM_signal_for_kill=SIGINT|SIGINT
RM_fail_detection=true
RM_launch_str=-f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c

dongahn commented 8 years ago

@dongahn, I have some commits on the intel_hydra_prelim branch that implement adding SGI hostnames and enabling the use of these via a configure flag.

This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?

@jsthrn: Sorry for the late response. So I sent you a collaborator request. Up on accepting it, you should have a push privilege, I think.

dongahn commented 8 years ago

@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1 fails).

The launch tests fail with errors like:

So, it looks like mpiexec.hydra is looking for a nodelist (command line argument -f) which is not present.

All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.

The rm configuration file looks reasonable to me although you will probably test whether sending two consecutive SIGINTs is the right sequence to kill the target job cleanly in Hydra. Different RMs can have different ways to "cleanly" kill the job and you have to adjust your configuration for Hydra.

In addition, test.launch_6_engine_failure should allow you to manually test the various failure semantics. The semantics is documented in here.

Now, when I tested for feasibility on your system for launch mode, I was able to get test.launch_1 to work. So I don't think there is anything fundamentally wrong. At the point where this test is ready to launch the tool daemons, the hostname file should have been generated and -f %l should be expanded into a valid string.

If the complain about -f comes from the launching string of the target application itself, IOW, the MPI application, that's a different story.

The front-end test code (test/src/fe_launch_smoketest.cxx) I used for testing actually used -f nodelist to test whether mpiexec.hydra knows how to launch a job using the manually written nodelist.

Your port shouldn't use that flag. Instead whatever the set of flags you will use to launch an MPI application under an interactive batch allocation should the ones you should type into the front-end test code. Hope this helps...

jsthrn commented 8 years ago

Thanks. I changed test/src/fe_launch_smoketest.cxx to launch with mpiexec.hydra -n <numprocs>. I think that this is the correct set of flags under an interactive batch allocation (it works for me). The only slight issue might be with cases like test.launch_2_half, where all the MPI processes run on the first (of two) nodes. I am not sure if they are supposed to be split equally between the two.

Pressing <Ctrl-C> twice does seem to be the correct sequence to kill the target cleanly.

I have submitted a pull request containing my changes. I am not sure exactly what the correct behaviour for all of the tests is, but I think that most pass. Issues that I am aware of include:

test.attach_1_pdebugmax: Runs (and passes the test), but does not terminate (basically keeps printing APP (INFO): stall for 3 secs indefinitely.
test.launch_mw_1_hostlist and test.launch_mw_5_mixall: Complete initial handshake, but then respond with cy013.ib0.smc-default.sgi.com: Connection refused and the tests do not appear to continue (although the application does resume). cy013 is the cluster head node (where compilation occurs, but no MPI processes run). This behaviour is not seen for test.launch_mw_2_coloc, which does pass.
test.attach_3_*: All fail with output like including <LMON FE API> (ERROR): the launchmon engine encountered an error while parsing its command line. and <LMON FE API> (ERROR): LMON_fe_acceptEngine failed. However, these look like they may be expected fails.
test.launch_3_invalid_dmonpath: Also may be an expected fail. Test outputs <OptionParser> (ERROR): the path[/invalid/be_kicker] does not exit. and then fails.

LLNL / LaunchMON

Assist SGI to port to Intel MPI with Hydra launcher #14