UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 26 forks source link

STAR-CD: Investigate problems with multi-node jobs #50

Closed balston closed 7 years ago

balston commented 8 years ago

STAR-CD using Intel MPI works within a single node but doesn't work across nodes on either Legion or Grace.

balston commented 8 years ago

These are the errors you get using the STAR-CD provided Intel MPI 4.1:

[0] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[2] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[5] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[1] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[3] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[4] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[6] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[7] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[8] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[9] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[10] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[11] MPI startup(): tmi fabric is not available and fallback fabric is not enabled

It should be using shm:dapl

balston commented 8 years ago

We tried modifying the module file to use our Intel MPI from default modules but this has had no effect. Log file still suggests its using 4.1:

PNP: Initialized [2016-06-28-17:00:10] Automatic Sequential Automatic Parallel analyzer.
PNP:   Allocated "node-x02d-017,12 node-x02d-005,12" resource (24 processes).
PNP:   Assigned  "node-x02d-017,12 node-x02d-005,12" resource to STAR solver (24 processes).
PNP:   Activated "/usr/bin/ssh -o StrictHostKeyChecking=no" command for starting tasks on remote nodes.
PNP:   Activated "/usr/bin/scp -o StrictHostKeyChecking=no" command for copying files to/from remote nodes.
PNP: Loading     STAR double precision solver dynamic shared object plug-ins.
PNP: Loading     Intel MPI 4.1 dynamic shared object plug-ins.
PNP: Starting    TRACKER task on "node-x02d-017" and "node-x02d-005" for monitoring host and process failures.
PNP: Spawning    STAR process on multiple nodes.
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
PNP: Shutdown    [2016-06-28-17:00:16] Execution terminated after 5.22 seconds (OVERALL ELAPSED TIME).
heatherkellyucl commented 8 years ago

Brief summary of current position as I understand it:

  1. Using the default MPI (Platform MPI), the job will stop at a random time with "PNP: Shutdown [2016-05-16-12:37:26] Execution terminated after 390.46 seconds (OVERALL ELAPSED TIME)" and only "MPI Application rank 0 killed before MPI_Finalize() with signal 9" in the error file. This happens within a single node as well as across nodes.
  2. Using -mpi=intel will work within one node, but when using multiple nodes the same random stopping problem occurs. This is using the included Intel MPI 4.1.1 - we needed to edit STAR-CD/4.22.058/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf to set the psm version to 1.2 rather than 1.1, or you get the error that "tmi fabric is not available and fallback fabric is not enabled".
  3. Setting the fabric to shm:tcp was also randomly stopping. (The correct usual fabric is shm:tmi).

To get it to use our Intel MPI 5 you needed to set INTELMPITOP and INTELMPI both to $I_MPI_ROOT as we don't have an extra architecture directory in between. This didn't help though.

heatherkellyucl commented 8 years ago

More detail:

If you use star -mpi=intel on its own leaving our default Intel MPI module loaded, the job stops immediately with tmi fabric is not available and fallback fabric is not enabled in the .e file.

If you also put a module unload mpi in your script, then when using the default PlatformMPI then I get:

  FINISH ITERATION NO.    49
  SOLVER: CPU time is      115.69    Elapsed time is      116.54
  TOTAL : CPU time is      126.53    Elapsed time is      138.46
  &&&& --------------------------------------------------------------  --------------------------------------------------------------

                                        **************************************************
                                        * THERE ARE      2 MASTER PROCESSOR REPORTED WARNINGS IN FILE static1.75RNG.info
                                        **************************************************

   *** CALCULATIONS TERMINATED - JOB STOPPED BY USER INTERVENTION
 [ 21 ]  ***Received signal 15 - EXITING
 [ 19 ]  ***Received signal 15 - EXITING
 [ 22 ]  ***Received signal 15 - EXITING
 [ 23 ]  ***Received signal 15 - EXITING
 [ 18 ]  ***Received signal 15 - EXITING
 [ 20 ]  ***Received signal 15 - EXITING
 [ 17 ]  ***Received signal 15 - EXITING
 [ 24 ]  ***Received signal 15 - EXITING
*** PROCESSOR     21 WILL STOP PROCESS NOW
*** PROCESSOR     20 WILL STOP PROCESS NOW
*** PROCESSOR     19 WILL STOP PROCESS NOW
*** PROCESSOR     18 WILL STOP PROCESS NOW
*** PROCESSOR     22 WILL STOP PROCESS NOW
*** PROCESSOR     24 WILL STOP PROCESS NOW
*** PROCESSOR     17 WILL STOP PROCESS NOW
*** PROCESSOR     23 WILL STOP PROCESS NOW
PNP: Received    [2016-07-06-10:49:12] SIGCHLD from HOST "node-u04a-002", TRACKD process-ID "119100".
PNP: Received    [2016-07-06-10:49:14] SIGKILL from HOST "node-u04a-015", REQUESTER process-ID "117807".
PNP: Shutdown    [2016-07-06-10:49:17] Execution stopped due to process failure (SIGCHLD) after 145.82 seconds (OVERALL ELAPSED TIME).

In the .e

MPI Application rank 1 killed before MPI_Finalize() with signal 9
 [ 12 ]  ***Received signal 15 - EXITING
 [ 13 ]  ***Received signal 15 - EXITING
 [ 8 ]  ***Received signal 15 - EXITING
 [ 11 ]  ***Received signal 15 - EXITING
 [ 16 ]  ***Received signal 15 - EXITING
 [ 9 ]  ***Received signal 15 - EXITING
 [ 10 ]  ***Received signal 15 - EXITING
 [ 15 ]  ***Received signal 15 - EXITING
*** PROCESSOR      8 WILL STOP PROCESS NOW
*** PROCESSOR     14 WILL STOP PROCESS NOW

With star -mpi=intel and module unload mpi you first get a whole load of address and IB errors, then it starts and gets the same error.

PNP: Initialized [2016-07-05-16:52:59] Automatic Sequential Automatic Parallel analyzer.
PNP:   Allocated "node-x02b-001,12 node-x02b-003,12" resource (24 processes).
PNP:   Assigned  "node-x02b-001,12 node-x02b-003,12" resource to STAR solver (24 processes).
PNP:   Activated "/usr/bin/ssh -o StrictHostKeyChecking=no" command for starting tasks on remote nodes.
PNP:   Activated "/usr/bin/scp -o StrictHostKeyChecking=no" command for copying files to/from remote nodes.
PNP: Loading     STAR double precision solver dynamic shared object plug-ins.
PNP: Loading     Intel MPI 4.1 dynamic shared object plug-ins.
PNP: Starting    TRACKER task on "node-x02b-001" and "node-x02b-003" for monitoring host and process failures.
PNP: Spawning    STAR process on multiple nodes.
node-x02b-003:CMA:3dde:2810ccc0: 68 us(68 us):  open_hca: getaddr_netdev ERROR:Cannot assign requested address. Is ib0 configured?
node-x02b-003:CMA:3dde:2810ccc0: 33 us(33 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
node-x02b-003:CMA:3dde:2810ccc0: 33 us(33 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib2 configured?
node-x02b-003:CMA:3dde:2810ccc0: 33 us(33 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib3 configured?
node-x02b-003:CMA:3ddf:b1a70cc0: 58 us(58 us):  open_hca: getaddr_netdev ERROR:Cannot assign requested address. Is ib0 configured?
...
node-x02b-001:CMA:712c:b0778cc0: 34 us(34 us):  open_hca: getaddr_netdev ERROR:No such device. Is bond0 configured?
node-x02b-001:CMA:712f:5ec55cc0: 34 us(34 us):  open_hca: getaddr_netdev ERROR:No such device. Is bond0 configured?
  RUNNING CASENAME static1.75RNG
License version: 06 February 2013
Required feature version set to 2014.06 or later
Checking license file: 1999@ntsrv1.meng.ucl.ac.uk
Checking license file: /shared/ucl/apps/STAR-CD/4.22.058/license/license.dat
Unable to list features for license file /shared/ucl/apps/STAR-CD/4.22.058/license/license.dat.
Asked for 1 licenses of starpar and got 0
1 copy of starsuite checked out from 1999@ntsrv1.meng.ucl.ac.uk
Feature starsuite expires in 113 days
3 copies of hpcdomains checked out from 1999@ntsrv1.meng.ucl.ac.uk
Feature hpcdomains expires in 113 days
20 copies of starsuite checked out from 1999@ntsrv1.meng.ucl.ac.uk
Feature starsuite expires in 113 days

STAR 4.22.058 [Brent_105]: linux64_2.6-ifort_13.1-glibc_2.5

...

  FINISH ITERATION NO.    14
  SOLVER: CPU time is       90.60    Elapsed time is       91.52
  TOTAL : CPU time is      106.52    Elapsed time is      128.54
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
PNP: Received    [2016-07-05-16:55:19] SIGCHLD from HOST "node-x02b-003", TRACKD process-ID "30150".
PNP: Received    [2016-07-05-16:55:20] SIGKILL from HOST "node-x02b-001", REQUESTER process-ID "28925".
PNP: Shutdown    [2016-07-05-16:55:26] Execution stopped due to process failure (SIGCHLD) after 146.56 seconds (OVERALL ELAPSED TIME).

In the .e:

sh: numactl: command not found
sh: numactl: command not found
...
 [ 13 ]  ***Received signal 2 -  [ 14 ]  ***Received signal 2 - EXITING
 [ 15 ]  ***Received signal 2 - EXITING
 [ 16 ]  ***Received signal 2 - EXITING
 [ 17 ]  ***Received signal 2 - EXITING
 [ 18 ]  ***Received signal 2 - EXITING
 [ 19 ]  ***Received signal 2 - EXITING
 [ 20 ]  ***Received signal 2 - EXITING
 [ 21 ]  ***Received signal 2 - EXITING
 [ 24 ]  ***Received signal 2 - EXITING
EXITING
 [ 22 ]  ***Received signal 2 - EXITING
*** PROCESSOR     13 WILL STOP PROCESS NOW
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
heatherkellyucl commented 8 years ago

I also looked back through the ticket history for Star-CD. We fixed the ping setuid bit issue in Dec 2015, and I ran static1.75RNG.ccmg on Legion across 2 X nodes (24 cores, job 426926). It did 1211 iterations before hitting the 1hr wallclock limit I gave it. So PlatformMPI appeared to be working at this point. I did still have a module unload mpi in my jobscript.

We were under the impression parallel jobs were working at that point, until April 2016 when the Lustre issues involved lots of reimaging and updating. The setuid bit error reoccurred because of the reimaging - that was fixed and then the MPI killed error started to happen. This was still during the Lustre issues and more updates happened afterwards.

heatherkellyucl commented 8 years ago

Sent the above to the contact at cd-adapco, got an out of office response at the moment.

heatherkellyucl commented 8 years ago

On request, going to test running it in serial with star only to see if it works then. I am asking for 12hrs wallclock to begin with, and have module unload mpi in the script. (I'll also test with it loaded, but that shouldn't affect the serial test).

heatherkellyucl commented 8 years ago

You still need to specify #$ -pe mpi 1 for a serial job, or else it complains it can't find a machinefile.

heatherkellyucl commented 8 years ago

The serial job got to 1134 iterations in the 12hrs before it ran out of time.

heatherkellyucl commented 8 years ago

Going to continue with star -restart and see if it completes in another 12hrs.

balston commented 8 years ago

We have a completed run of the test simulation using Intel MPI within a single node (8 cores) from the 23rd May. This converged after 7281 iterations. Here is the beginning of the output file:

PNP: Initialized [2016-05-23-11:19:11] Automatic Sequential Automatic Parallel analyzer.
PNP:   Allocated "node-u05a-002,8" resource (8 processes).
PNP:   Assigned  "node-u05a-002,8" resource to STAR solver (8 processes).
PNP:   Activated "/usr/bin/ssh -o StrictHostKeyChecking=no" command for starting tasks on remote nodes.
PNP:   Activated "/usr/bin/scp -o StrictHostKeyChecking=no" command for copying files to/from remote nodes.
PNP: Loading     STAR double precision solver dynamic shared object plug-ins.
PNP: Loading     Intel MPI 4.1 dynamic shared object plug-ins.
PNP: Starting    TRACKER task on "node-u05a-002" for monitoring host and process failures.
PNP: Spawning    STAR process on single node.
  RUNNING CASENAME static1.75RNG
License version: 06 February 2013
Required feature version set to 2014.06 or later
Checking license file: 1999@ntsrv1.meng.ucl.ac.uk
Checking license file: /shared/ucl/apps/STAR-CD/4.22.058/license/license.dat
Unable to list features for license file /shared/ucl/apps/STAR-CD/4.22.058/license/license.dat.
Asked for 1 licenses of starpar and got 0
1 copy of starsuite checked out from 1999@ntsrv1.meng.ucl.ac.uk
Feature starsuite expires in 156 days
7 copies of starsuite checked out from 1999@ntsrv1.meng.ucl.ac.uk
Feature starsuite expires in 156 days

STAR 4.22.058 [Brent_105]: linux64_2.6-ifort_13.1-glibc_2.5

  *** Computational domain already partitioned with requested parameters ***
      Partition information read from file 

 *** Reading geometry ***
     Sorting cells
     Sorting faces
  *** Read geometry complete (written by pro-STAR  4.22.008) ***

                                      |-------------------------------------------|
                                      |          STAR-CD VERSION 4.22.058         |
                                      |         THERMOFLUIDS ANALYSIS CODE        |
                                      |     Operating System:           Linux     |
                                      | Stardate: 23-MAY-2016  Startime: 11:19:17 |
                                      |-------------------------------------------|

                        |-----------------------------------------------------------------------|
                        |                          S T A R   -   H P C                          |
                        |                      HIGH PERFORMANCE COMPUTING                       |
                        |                   MESSAGE PASSING PARALLEL VERSION                    |
                        |                                                                       |
                        |  COMMUNICATIONS:                                       Intel MPI 4.1  |
                        |  NUMBER OF PROCESSES:                                              8  |
                        |  DOMAIN DECOMPOSITION METHOD:                                  METIS  |
                        |  VERTEX NUMBERING COMPRESSION:                                    ON  |
                        |  INPUT/OUTPUT MODE:                              MASTER PROCESS ONLY  |
                        |-----------------------------------------------------------------------|

                              |-----------------------------------------------------------|
                              | STAR Copyright (C) 1988-2015, Computational Dynamics Ltd. |
                              | Proprietary  data  ---  Unauthorized  use,  distribution, |
                              | or  duplication  is  prohibited.   All  rights  reserved. |
                              |-----------------------------------------------------------|

              |-------------------------------------------------------------------------------------------|
              |  ---------------------------- PROBLEM SPECIFICATION SUMMARY ----------------------------  |
              |-------------------------------------------------------------------------------------------|
              |  CASE TITLE .................. =>                                                         |
              |  NUMBER OF CELLS ............. =>    853353                                               |
              |  MESH DIMENSIONS                      XMIN     XMAX     YMIN     YMAX     ZMIN     ZMAX   |
              |       (IN METRES) ............ =>  -1.5E-01  4.5E-01 -1.5E-01  1.5E-01 -3.0E-01  1.4E-01  |
              |  MESH QUALITY ................ =>                                                         |
              |    Expansion factor .......... =>     Aver =   1.24, Max =  27.40 CVs:   849514,   855493 |
              |    Non-orthogonality (deg).... =>     Aver =  12.40, Max = 136.20 CVs:   808049,   849567 |
              |    Cells with large concavity  =>          39                                             |
              |  RUN PRECISION ............... =>     Double                                              |
              |  STEADY ANALYSIS ..............=>     START FROM ITERATION =        0                     |
              |  SOLUTION PROCEDURE .......... =>     SIMPLE                                              |
              |  RESIDUAL TOLERANCE .......... =>     1.00E-06                                            |
              |  MAX. NO. OF ITERATIONS ...... =>    10000                                                |
              |  INTERMEDIATE RESTART DATA ... =>     WILL BE SAVED IN static1.75RNG.ccmp                 |
              |                                       (EVERY N=100 ITERATIONS)                            |
              |  BACKED UP RESTART DATA ...... =>     WILL BE SAVED IN static1.75RNG.ccmp_N               |
              |                                       (EVERY N=500 ITERATIONS)                            |
              |  LAST RESTART DATA ........... =>     WILL BE SAVED IN static1.75RNG.ccmp                 |
              |  RESTART DATA PRECISION....... =>     DOUBLE                                              |
              |  SURFACE DATA ................ =>     WILL NOT BE SAVED                                   |
              |  CONVERGENCE DATA ............ =>     WILL BE PRINTED IN static1.75RNG.info               |
              |  FIELD DATA .................. =>     WILL NOT BE PRINTED                                 |
              |  LIN. ALG. EQU. SOLVER ....... =>     Conjugate gradient with Incompl. Cholesky precond.  |
              |-------------------------------------------------------------------------------------------|
              |-> DOMAIN  1 (FLUID) : AIR   --------------------------------------------------------------|
              |-------------------------------------------------------------------------------------------|
              |  SOLVE ....................... =>   U,  V,  W,  P, TE, ED,  T,                            |
              |                                    (STATIC ENTHALPY, THERMAL FORM TRANSPORTED)            |
              |  FLUID FLOW .................. =>   TURBULENT COMPRESSIBLE                                |
              |  TURBULENCE MODEL ............ =>   K-EPS RNG MODEL                                       |
              |    CONSTANTS ................. =>     C_mu=0.09, C_1=1.42, C_2=1.68, C_3=1.42, C_4=-0.39  |
              |                                =>     cappa=0.400, Pr_k=0.72, Pr_eps=0.719, Pr=0.90       |
              |                                =>     beta=0.012, eta_0=4.380                             |
              |  REFERENCE PRESSURE .......... =>   PREF = 1.000E+05 Pa                                   |
              |  REFERENCE TEMPERATURE ....... =>   TREF = 2.730E+02 K                                    |
              |  DENSITY ..................... =>   IDEAL GAS: MOLW = 2.896E+01                           |
              |  MOLECULAR VISCOSITY ......... =>   CONSTANT -    MU = 1.810E-05 Pas                      |
              |  SPECIFIC HEAT ............... =>   CONSTANT -     C = 1.006E+03 J/kgK                    |
              |  CONDUCTIVITY ................ =>   CONSTANT -     K = 2.637E-02 W/mK                     |
              |  TURBULENT PRANDTL NUMBER..... =>   PRTUR = 9.000E-01                                     |
              |  RELA. FAC. FOR INIT. FLUXES . =>   REL.FAC. = 1.000E+00                                  |
              |  INITIAL FIELD VALUES ........ =>      u        v        w      Omega      p        T     |
              |                                =>   0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 2.93E+02 |
              |                                =>    Tur.In. Len.Sc.                                      |
              |                                =>   2.58E-02 1.00E-01                                     |
              |  BOUNDARY CONDITIONS ......... =>                                                         |
              |  Reg.    1  Constant   piezomet. pressure: P = 0.000E+00                                  |
              |                    U = 0.000E+00 V = 0.000E+00 W = 0.000E+00 Om = 0.000E+00 in C.Sys.   1 |
              |                  k and epsilon        Zero gradient                                       |
              |                  T = 3.000E+02                                                            |
              |  Reg.    2  Stagn. Inl.: Pstag = 5.000E+03 Tstag = 2.880E+02                              |
              |                  Absolute velocity normal to the boundary                                 |
              |                  TI = 3.900E-02 TLS = 1.500E-02                                           |
              |  Reg.    3  Wall:  U = 0.000E+00 V = 0.000E+00 W = 0.000E+00 Om = 0.000E+00 in C.Sys.   1 |
              |                  Elog = 9.000E+00                                                         |
              |                  Thermal condition:  Adiabatic                                            |
              |-------------------------------------------------------------------------------------------|
              |-> ADDITIONAL FEATURES  -------------------------------------------------------------------|
              |-------------------------------------------------------------------------------------------|
              |  RAMFILES OPTION ENABLED                                                                  |
              |-------------------------------------------------------------------------------------------|
              |-> SOLUTION CONTROL PARAMETERS                                                             |
              |-------------------------------------------------------------------------------------------|
              |  EQUATION   |      Mome        Mass        Ener        Turb         --          --        |
              |-------------------------------------------------------------------------------------------|
              |  RELA. FAC. |    4.000E-01   2.000E-01   7.000E-01   4.000E-01      --          --        |
              |  DIFF. SCH. |      MARS         CD         MARS        MARS         --          --        |
              |  DSCH. FAC. |    5.000E-01   1.000E-02   5.000E-01   5.000E-01      --          --        |
              |  SOLV. TOL. |    1.000E-04   5.000E-05   1.000E-04   1.000E-04      --          --        |
              |  SWEEP LIM. |      1000       10000        1000        1000         --          --        |
              |-------------------------------------------------------------------------------------------|

   Iter. I--------------- GLOBAL ABSOLUTE RESIDUAL ------------------I  I-------- FIELD VALUES AT MONITORING POINT       1 ----------I
    No    Mome     Mass     Ener     Turb      --       --       --       Vel      Pres     Temp    TurVis     --       --       --      
     1  1.00E+00 1.00E+00 5.31E-03 1.00E+00    --       --       --     3.84E-02 1.51E-02 2.93E+02 3.20E-04    --       --       --      
  FINISH ITERATION NO.     1
  SOLVER: CPU time is        6.38    Elapsed time is       12.54
  TOTAL : CPU time is        9.03    Elapsed time is       21.22
     2  5.56E-01 1.00E+00 1.24E-03 6.83E-01    --       --       --     5.35E-02 1.48E-02 2.93E+02 3.20E-04    --       --       --      

and here is the end:

  FINISH ITERATION NO.  7281
  SOLVER: CPU time is    32135.12    Elapsed time is    32597.36
  TOTAL : CPU time is    32890.87    Elapsed time is    33598.63
  &&&& --------------------------------------------------------------  --------------------------------------------------------------

                                        **************************************************
                                        * THERE ARE      2 MASTER PROCESSOR REPORTED WARNINGS IN FILE static1.75RNG.info
                                        **************************************************

   *** CALCULATIONS TERMINATED - CONVERGENCE CRITERION SATISFIED

  END OF EXECUTION - STAR 
  SOLVER: CPU time is    32135.12    Elapsed time is    32597.36
  TOTAL : CPU time is    32891.72    Elapsed time is    33601.18

PNP: Shutdown    [2016-05-23-20:39:18] Execution completed after 33606.84 seconds (OVERALL ELAPSED TIME).
heatherkellyucl commented 8 years ago

I have a 24-core default MPI job queued to see if strace -f -ostraceall${JOB_ID}.trace star works (ltrace needs to be attached to the running process as it can't follow star's startup scripts, strace may not).

If I do need to attach them manually, it may take a while to get a pair of nodes with qrsh as the machine is busy.

heatherkellyucl commented 8 years ago

I have a pile of strace output for one run with PlatformMPI and one with IntelMPI. You do need to attach strace to a running process as if you try to run it from the start with strace you get this and the program exits.

PNP: ***ERROR*** No response from HOST "node-x02b-011" or invalid hostname.
PNP:         ==> Please check that you are specifying your nodes correctly.

I attached strace to what pstree shows as the main PID (shows as python). The output begins by showing a wait, and I think all the rest of the output you get occurs when the shutdown begins.

I am also going to try switching on MPI debugging for both runs, to see if we can find out about differences in the setup they use.

For Intel MPI: star -mpi=intel -mppflags="-genv I_MPI_DEBUG 3"

For PlatformMPI this might be useful: https://www.ibm.com/support/knowledgecenter/SSF4ZA_9.1.4/pmpi_guide/debug_envars_mpi.html It is for a newer version as there aren't any docs for 8.3 on IBM's site. (Also check in the star directory to see if any of the PlatformMPI PDFs in there are useful).

heatherkellyucl commented 8 years ago

For the Intel MPI case, running star -mpi=intel -mppflags="-genv I_MPI_DEBUG 9"it shows why it was getting the open_hca: getaddr_netdev errors. (I've only included the errors from MPI rank 0 below, as they are all repeated for each rank).

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 1  Build 20130522
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation.  All rights reserved.
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib3
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-bond
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] MPI startup(): cannot open dynamic library libdat.so
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-1
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-2
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-3
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-bond
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] MPI startup(): Found 1 IB devices
[0] MPI startup(): Open 0 IB device: qib0
[0] MPI startup(): Start 1 ports per adapter
[0] MPID_nem_ofacm_init(): Init
[0] MPI startup(): shm and ofa data transfer modes
[0] MPI startup(): Device_reset_idx=0
[0] MPI startup(): Allgather: 2: 0-1071 & 0-12
[0] MPI startup(): Allgather: 3: 1071-27498 & 0-12

- more MPI startup commands here -

[0] MPI startup(): Rank    Pid      Node name      Pin cpu
[0] MPI startup(): 0       7644     node-x02c-005  0
[0] MPI startup(): 1       7645     node-x02c-005  1
[0] MPI startup(): 2       7646     node-x02c-005  2
[0] MPI startup(): 3       7647     node-x02c-005  3
[0] MPI startup(): 4       7648     node-x02c-005  4
[0] MPI startup(): 5       7649     node-x02c-005  5
[0] MPI startup(): 6       7650     node-x02c-005  6
[0] MPI startup(): 7       7651     node-x02c-005  7
[0] MPI startup(): 8       7652     node-x02c-005  8
[0] MPI startup(): 9       7653     node-x02c-005  9
[0] MPI startup(): 10      7654     node-x02c-005  10
[0] MPI startup(): 11      7655     node-x02c-005  11
[0] MPI startup(): 12      7993     node-x02c-006  0
[0] MPI startup(): 13      7994     node-x02c-006  1
[0] MPI startup(): 14      7995     node-x02c-006  2
[0] MPI startup(): 15      7996     node-x02c-006  3
[0] MPI startup(): 16      7997     node-x02c-006  4
[0] MPI startup(): 17      7998     node-x02c-006  5
[0] MPI startup(): 18      7999     node-x02c-006  6
[0] MPI startup(): 19      8000     node-x02c-006  7
[0] MPI startup(): 20      8001     node-x02c-006  8
[0] MPI startup(): 21      8002     node-x02c-006  9
[0] MPI startup(): 22      8003     node-x02c-006  10
[0] MPI startup(): 23      8004     node-x02c-006  11
[0] MPI startup(): Recognition=2 Platform(code=4 ippn=4 dev=8) Fabric(intra=1 inter=2 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=9
[0] MPI startup(): I_MPI_INFO_BRAND=Intel(R) Xeon(R) 
[0] MPI startup(): I_MPI_INFO_CACHE1=0,1,2,8,9,10,16,17,18,24,25,26
[0] MPI startup(): I_MPI_INFO_CACHE2=0,1,2,8,9,10,16,17,18,24,25,26
[0] MPI startup(): I_MPI_INFO_CACHE3=0,0,0,0,0,0,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_CACHES=3
[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=2,2,32
[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,262144,12582912
[0] MPI startup(): I_MPI_INFO_CORE=0,1,2,8,9,10,0,1,2,8,9,10
[0] MPI startup(): I_MPI_INFO_C_NAME=Westmere-EP
[0] MPI startup(): I_MPI_INFO_DESC=1342182930
[0] MPI startup(): I_MPI_INFO_FLGB=0
[0] MPI startup(): I_MPI_INFO_FLGC=43967487
[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569
[0] MPI startup(): I_MPI_INFO_LCPU=12
[0] MPI startup(): I_MPI_INFO_MODE=263
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1
[0] MPI startup(): I_MPI_INFO_PACK=0,0,0,0,0,0,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_SERIAL=X5650  
[0] MPI startup(): I_MPI_INFO_SIGN=132802
[0] MPI startup(): I_MPI_INFO_STATE=0
[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0,0,0,0,0
[0] MPI startup(): I_MPI_INFO_VEND=1
[0] MPI startup(): I_MPI_PIN_INFO=0
[0] MPI startup(): I_MPI_PIN_MAPPING=12:0 0,1 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 11

Then there is the normal STAR-CD output, RUNNING CASENAME static1.75RNG etc until it dies.

heatherkellyucl commented 8 years ago

That says the Intel MPI run is using shm:ofa as the fabric, not shm:tmi. That means it is using OFED verbs and not TMI for communications between nodes. I could try export I_MPI_FABRICS=shm:tmi and see what the errors were with that.

heatherkellyucl commented 8 years ago

For the PlatformMPI run, I ran as

export MPIRUN_OPTIONS="-v -d"
star

The output didn't add anything useful:

debug 1, pretend 0, verbose 1
job 0, check 0, tv=0, mpirun_instr ???
remsh = ssh
appfile = starmpp.cfg
Main socket port 52346
Parsing application description...
Identifying hosts...
Spawning processes...
IBM Platform MPI licensed for CD-adapco.
Process layout for world 0 is as follows:
mpirun:  proc 23677
  daemon proc 23680 on host 10.143.16.12
    rank 0:  proc 23705
    rank 1:  proc 23706
    rank 2:  proc 23707
    rank 3:  proc 23708
    rank 4:  proc 23709
    rank 5:  proc 23710
    rank 6:  proc 23711
    rank 7:  proc 23712
    rank 8:  proc 23713
    rank 9:  proc 23714
    rank 10:  proc 23715
    rank 11:  proc 23716
  daemon proc 15069 on host 10.143.16.17
    rank 12:  proc 15091
    rank 13:  proc 15092
    rank 14:  proc 15093
    rank 15:  proc 15094
    rank 16:  proc 15095
    rank 17:  proc 15096
    rank 18:  proc 15097
    rank 19:  proc 15098
    rank 20:  proc 15099
    rank 21:  proc 15100
    rank 22:  proc 15101
    rank 23:  proc 15102

I'd tried with the -i option for mpirun instrumentation, but it failed with an mpid internal error.

mpid: PATH=/shared/ucl/apps/STAR-CD/4.22.058/sbin:/shared/ucl/apps/STAR-CD/4.22.058/sbin:/shared/ucl/apps/STAR-CD/4.22.058/bin:/shared/ucl/apps/cluster-scripts:/shared/ucl/sysops/bin:/shared/ucl/apps/rcops_scripts:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/bin/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/bin/intel64_mic:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/debugger/gdb/intel64_mic/bin:/shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/tmux/2.2/gnu-4.9.2/bin:/shared/ucl/apps/emacs/24.5/gnu-4.9.2/bin:/shared/ucl/apps/giflib/5.1.1/gnu-4.9.2/bin:/shared/ucl/apps/dos2unix/7.3/gnu-4.9.2/bin:/shared/ucl/apps/NEdit/5.6-Aug15/bin:/shared/ucl/apps/nano/2.4.2/gnu-4.9.2//bin:/shared/ucl/apps/GERun:/shared/ucl/apps/screen/4.2.1/bin:/shared/ucl/apps/subversion/1.8.13/bin:/shared/ucl/apps/apr-util/1.5.4/bin:/shared/ucl/apps/apr/1.5.2/bin:/shared/ucl/apps/git/2.3.5/gnu-4.9.2/bin:/shared/ucl/apps/flex/2.5.39/gnu-4.9.2/bin:/shared/ucl/apps/cmake/3.2.1/gnu-4.9.2/bin:/shared/ucl/apps/gcc/4.9.2/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/tmpdir/active/1478833.1.Kermit:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/cceahke/bin
mpid: PWD=/scratch/scratch/cceahke/output/starcd/test-star-multi
mpid: Internal Error: execvp failed: Cannot execute LD_LIBRARY_PATH=/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi:/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64:/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.4:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/ipp/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mpirt/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64:/shared/ucl/apps/giflib/5.1.1/gnu-4.9.2/lib:/shared/ucl/apps/subversion/1.8.13/lib:/shared/ucl/apps/apr-util/1.5.4/lib:/shared/ucl/apps/apr/1.5.2/lib:/shared/ucl/apps/git/2.3.5/gnu-4.9.2/lib64:/shared/ucl/apps/flex/2.5.39/gnu-4.9.2/lib:/shared/ucl/apps/gcc/4.9.2/lib:/shared/ucl/apps/gcc/4.9.2/lib64: No such file or directory
MPI Application rank 0 exited before MPI_Init() with status 1
mpirun: Broken pipe
mpirun: propagating signal 13

The SPMD command it lists when trying to do that looks wrong:

debug 1, pretend 0, verbose 1
job 0, check 0, tv=0, mpirun_instr -e
remsh = ssh
SPMD cmd: LD_LIBRARY_PATH=/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi:/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64:/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.4:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/ipp/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/compiler/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mpirt/lib/intel64:/shared/ucl/apps/intel/2015/composer_xe_2015.2.164/mkl/lib/intel64:/shared/ucl/apps/giflib/5.1.1/gnu-4.9.2/lib:/shared/ucl/apps/subversion/1.8.13/lib:/shared/ucl/apps/apr-util/1.5.4/lib:/shared/ucl/apps/apr/1.5.2/lib:/shared/ucl/apps/git/2.3.5/gnu-4.9.2/lib64:/shared/ucl/apps/flex/2.5.39/gnu-4.9.2/lib:/shared/ucl/apps/gcc/4.9.2/lib:/shared/ucl/apps/gcc/4.9.2/lib64 -e STARBOOT=/scratch/scratch/cceahke/output/starcd/test-star-multi -f starmpp.cfg
Main socket port 41297
Temporary appfile: /tmpdir/job/1478833.undefined/mpiafT72m3X
Building LocalHost file/block scheduled...nodeCnt == 0
Parsing application description...
Identifying hosts...
Spawning processes...
PNP: Shutdown    [2016-07-20-15:19:51] Execution terminated after 5.18 seconds (OVERALL ELAPSED TIME).

Now trying with a couple more environment variables set in the hope of getting more information.

heatherkellyucl commented 8 years ago

For PlatformMPI I ran again with

export MPIRUN_OPTIONS="-v -d"
export MPI_ERROR_LEVEL=2
export MPI_COLL_FCA_VERBOSE=9
star

This added a backtrace to the end, but didn't add any further setup info.

   *** CALCULATIONS TERMINATED - JOB STOPPED BY USER INTERVENTION
 [ 22 ]  ***Received signal 15 - EXITING
 [ 21 ]  ***Received signal 15 - EXITING
 [ 23 ]  ***Received signal 15 - EXITING
 [ 17 ]  ***Received signal 15 - EXITING
 [ 18 ]  ***Received signal 15 - EXITING
 [ 19 ]  ***Received signal 15 - EXITING
 [ 20 ]  ***Received signal 15 - EXITING
 [ 24 ]  ***Received signal 15 - EXITING
*** PROCESSOR     23 WILL STOP PROCESS NOW
*** PROCESSOR     19 WILL STOP PROCESS NOW
*** PROCESSOR     21 WILL STOP PROCESS NOW
*** PROCESSOR     18 WILL STOP PROCESS NOW
*** PROCESSOR     20 WILL STOP PROCESS NOW
*** PROCESSOR     24 WILL STOP PROCESS NOW
*** PROCESSOR     17 WILL STOP PROCESS NOW
*** PROCESSOR     22 WILL STOP PROCESS NOW

star:111108 terminated with signal 6 at PC=2ba2a8cb65f7 SP=7fff6ba6da78.  Backtrace:

star:111112 terminated with signal 6 at PC=2addb86bc5f7 SP=7fff08a32f38.  Backtrace:

star:111106 terminated with signal 6 at PC=2b3c99ba75f7 SP=7fffeade0cf8.  Backtrace:

star:111109 terminated with signal 6 at PC=2b6f1081f5f7 SP=7fff8dd197f8.  Backtrace:

star:111111 terminated with signal 6 at PC=2afa3bbf65f7 SP=7fff41bbc7b8.  Backtrace:

star:111105 terminated with signal 6 at PC=2b0d156c45f7 SP=7fffd21624b8.  Backtrace:

star:111107 terminated with signal 6 at PC=2b618e5315f7 SP=7fff4ef34178.  Backtrace:

star:111110 terminated with signal 6 at PC=2b2f4e50b5f7 SP=7fffc1939fb8.  Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x2b0d156c45f7]
/lib64/libc.so.6(gsignal+0x37)[0x2addb86bc5f7]
/lib64/libc.so.6(gsignal+0x37)[0x2b618e5315f7]
/lib64/libc.so.6(gsignal+0x37)[0x2ba2a8cb65f7]
/lib64/libc.so.6(gsignal+0x37)[0x2b6f1081f5f7]
/lib64/libc.so.6(gsignal+0x37)[0x2b3c99ba75f7]
/lib64/libc.so.6(gsignal+0x37)[0x2b2f4e50b5f7]
/lib64/libc.so.6(abort+0x148)[0x2b0d156c5ce8]
/lib64/libc.so.6(abort+0x148)[0x2b618e532ce8]
/lib64/libc.so.6(abort+0x148)[0x2ba2a8cb7ce8]
/lib64/libc.so.6(abort+0x148)[0x2addb86bdce8]
/lib64/libc.so.6(abort+0x148)[0x2b6f10820ce8]
/lib64/libc.so.6(abort+0x148)[0x2b3c99ba8ce8]
/lib64/libc.so.6(abort+0x148)[0x2b2f4e50cce8]
/lib64/libc.so.6(gsignal+0x37)[0x2afa3bbf65f7]
/lib64/libc.so.6(abort+0x148)[0x2afa3bbf7ce8]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2b618f7bb9f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2b0d1694e9f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2b2f4f7959f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2ba2a9f409f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2b3c9ae319f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2addb99469f2]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2b6f11aa99f2]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2b618bccb2cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2b0d12e5e2cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2b2f4bca52cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2ba2a64502cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2b3c973412cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2addb5e562cd]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2b6f0dfb92cd]
/shared/ucl/apps/STAR-CD/4.22.058/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4/lib/linux_amd64/libmpi.so(+0xdf9f2)[0x2afa3ce809f2]
/shared/ucl/apps/STAR-CD/4.22.058/STAR/4.22.058/linux64_2.6-ifort_13.1-glibc_2.5/lib/ibm_mpi/libstarmpp.so(mpp_abort_+0x1d)[0x2afa393902cd]
PNP: Received    [2016-07-20-16:10:26] SIGCHLD from HOST "node-u06a-019", TRACKD process-ID "98489".
PNP: Received    [2016-07-20-16:10:27] SIGKILL from HOST "node-u06a-018", REQUESTER process-ID "97292".
PNP: Shutdown    [2016-07-20-16:10:35] Execution stopped due to process failure (SIGCHLD) after 151.97 seconds (OVERALL ELAPSED TIME).
heatherkellyucl commented 8 years ago

With Intel MPI

export I_MPI_FABRICS="shm:tmi"
star -mpi=intel -mppflags="-genv I_MPI_DEBUG 10"

In the error output:

/etc/tmi.conf: No such file or directory
[16] MPI startup(): tmi fabric is not available and fallback fabric is not enabled

etc

It is looking for a central tmi.conf and not the one that is in $STARDIR/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf

Try export TMI_CONFIG=$STARDIR/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf

heatherkellyucl commented 8 years ago

Intel MPI:

export I_MPI_FABRICS="shm:tmi"
export TMI_CONFIG=$STARDIR/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf
export TMI_DEBUG=1
star -mpi=intel -mppflags="-genv I_MPI_DEBUG 100"

TMI config is still wrong.

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 1  Build 20130522
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation.  All rights reserved.
[0] MPID_nem_tmi_init(): pg=0x780d40, pg_rank=0
[0] init_tmi_library(): TMI lib version is 1.1 
init_provider_list: using configuration file: /shared/ucl/apps/STAR-CD/4.22.058/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf
init_provider_list: valid configuration line: psm 1.2 libtmip_psm.so " "
init_provider_list: valid configuration line: mx 1.0 libtmip_mx.so " "
[0] MPI startup(): cannot load default tmi provider
PNP: Shutdown    [2016-07-21-10:55:16] Execution terminated after 5.24 seconds (OVERALL ELAPSED TIME).
heatherkellyucl commented 8 years ago

I now have it back into the same state where it exits randomly after a short time, but this time it's definitely using the tmi fabric for between-node communications with Intel MPI. The tmi.conf was correct originally to have said libtmip_psm.so version 1.1 (the confusion came in that it may have been specifying a psm version instead). What was missing was setting TMI_CONFIG so it could be found.

[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 1  Build 20130522
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation.  All rights reserved.
[0] MPID_nem_tmi_init(): pg=0xb1fd40, pg_rank=0
[0] init_tmi_library(): TMI lib version is 1.1 
init_provider_list: using configuration file: /shared/ucl/apps/STAR-CD/4.22.058/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4/etc64/tmi.conf
init_provider_list: valid configuration line: psm 1.1 libtmip_psm.so " "
init_provider_list: valid configuration line: mx 1.0 libtmip_mx.so " "
tmi_psm_init: tmi_psm_timeout=120
init_provider_lib: using provider: psm, version 1.1
[0] init_tmi_provider(): TMI provider initialized: psm (1.1)
[0] init_tmi(): job_id = 8f0a0f16000125e6
[0] init_tmi(): tmi_ep_open options: TMI_PSM_JID=8f0a0f16000125e6, TMI_TNIC_NPEERS=24
[0] init_tmi(): tmi_ep_open returns 0
[0] init_tmi(): tmi_ep_get_addr returns 0
[0] init_tmi(): tmi_ep_get_addr: 24-00-00-00-00-00-00-00-03-05-24-00-00-00-00-00, size=16
[0] init_tmi(): signature_mask: ffef
[0] init_tmi(): local_endpoint_signature: 3002429
[0] MPI startup(): shm and tmi data transfer modes
[0] MPID_nem_tmi_vc_init(): vc=0xb216c0
[0] MPID_nem_tmi_vc_init(): eager_max_msg_size=-1
[0] MPID_nem_tmi_vc_connect(): calling tmi_connect from (24-00-00-00-00-00-00-00-03-05-24-00-00-00-00-00) to (26-00-00-00-00-00-00-00-03-09-26-00-00-00-00-00). ptr remote_endpoint_context=0xb23f58 MPID_nem_tmi_local_endpoint=0xb27b30
[0] MPID_nem_tmi_vc_connect(): ep check: name=psm, ep_connect=0x2b38df8b8134
[0] MPID_nem_tmi_vc_connect(): tmi_connect returns 0
[0] MPID_nem_tmi_vc_connect(): TMI errstr:Success
[0] MPID_nem_tmi_vc_connect(): vc=0xb216c0, state=1, connection_state=2
[0] MPID_nem_tmi_vc_init(): vc=0xb21840
[0] MPID_nem_tmi_vc_init(): eager_max_msg_size=-1

[0] MPI startup(): Recognition mode: 2, selected platform: 16 own platform: 16
[0] MPI startup(): Device_reset_idx=11
[0] MPI startup(): Allgather: 3: 0-0 & 0-16
[0] MPI startup(): Allgather: 1: 1-2549 & 0-16
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-16
[0] MPI startup(): Allgather: 3: 0-0 & 17-32
[0] MPI startup(): Allgather: 1: 1-90 & 17-32

[0] MPI startup(): Rank    Pid      Node name      Pin cpu
[0] MPI startup(): 0       75238    node-u06a-015  0
[0] MPI startup(): 1       75239    node-u06a-015  1
[0] MPI startup(): 2       75240    node-u06a-015  2
[0] MPI startup(): 3       75241    node-u06a-015  3
[0] MPI startup(): 4       75242    node-u06a-015  4
[0] MPI startup(): 5       75243    node-u06a-015  5
[0] MPI startup(): 6       75244    node-u06a-015  6
[0] MPI startup(): 7       75245    node-u06a-015  7
[0] MPI startup(): 8       75246    node-u06a-015  8
[0] MPI startup(): 9       75247    node-u06a-015  9
[0] MPI startup(): 10      75248    node-u06a-015  10
[0] MPI startup(): 11      75249    node-u06a-015  11
[0] MPI startup(): 12      75250    node-u06a-015  12
[0] MPI startup(): 13      75251    node-u06a-015  13
[0] MPI startup(): 14      75253    node-u06a-015  14
[0] MPI startup(): 15      75254    node-u06a-015  15
[0] MPI startup(): 16      116888   node-u06a-016  8
[0] MPI startup(): 17      116889   node-u06a-016  9
[0] MPI startup(): 18      116890   node-u06a-016  10
[0] MPI startup(): 19      116891   node-u06a-016  11
[0] MPI startup(): 20      116892   node-u06a-016  12
[0] MPI startup(): 21      116893   node-u06a-016  13
[0] MPI startup(): 22      116894   node-u06a-016  14
[0] MPI startup(): 23      116895   node-u06a-016  15
[0] MPI startup(): Recognition=2 Platform(code=16 ippn=4 dev=13) Fabric(intra=1 inter=3 flags=0x0)
[0] MPI startup(): Topology split mode = 1

| rank | node | space=2
|  0  |  0  |
|  1  |  0  |
|  2  |  0  |
|  3  |  0  |
|  4  |  0  |
|  5  |  0  |
|  6  |  0  |
|  7  |  0  |
|  8  |  0  |
|  9  |  0  |
|  10  |  0  |
|  11  |  0  |
|  12  |  0  |
|  13  |  0  |
|  14  |  0  |
|  15  |  0  |
|  16  |  1  |
|  17  |  1  |
|  18  |  1  |
|  19  |  1  |
|  20  |  1  |
|  21  |  1  |
|  22  |  1  |
|  23  |  1  |
[0] MPI startup(): I_MPI_DEBUG=100
[0] MPI startup(): I_MPI_FABRICS=shm:tmi
[0] MPI startup(): I_MPI_INFO_BRAND=Intel(R) Xeon(R) 
[0] MPI startup(): I_MPI_INFO_CACHE1=0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23
[0] MPI startup(): I_MPI_INFO_CACHE2=0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23
[0] MPI startup(): I_MPI_INFO_CACHE3=0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_CACHES=3
[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=2,2,32
[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,262144,20971520
[0] MPI startup(): I_MPI_INFO_CORE=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
[0] MPI startup(): I_MPI_INFO_C_NAME=Unknown
[0] MPI startup(): I_MPI_INFO_DESC=1342177285
[0] MPI startup(): I_MPI_INFO_FLGB=641
[0] MPI startup(): I_MPI_INFO_FLGC=2143216639
[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569
[0] MPI startup(): I_MPI_INFO_LCPU=16
[0] MPI startup(): I_MPI_INFO_MODE=263
[0] MPI startup(): I_MPI_INFO_PACK=0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_SERIAL=E5-2650 v2 
[0] MPI startup(): I_MPI_INFO_SIGN=198372
[0] MPI startup(): I_MPI_INFO_STATE=0
[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
[0] MPI startup(): I_MPI_INFO_VEND=1
[0] MPI startup(): I_MPI_PIN_INFO=0
[0] MPI startup(): I_MPI_PIN_MAPPING=16:0 0,1 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 11,12 12,13 13,14 14,15 15

...

  FINISH ITERATION NO.    51
  SOLVER: CPU time is      117.42    Elapsed time is      118.90
  TOTAL : CPU time is      127.14    Elapsed time is      135.66
  &&&& --------------------------------------------------------------  --------------------------------------------------------------

                                        **************************************************
                                        * THERE ARE      2 MASTER PROCESSOR REPORTED WARNINGS IN FILE static1.75RNG.info
                                        **************************************************

   *** CALCULATIONS TERMINATED - JOB STOPPED BY USER INTERVENTION
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
PNP: Received    [2016-07-21-12:29:12] SIGCHLD from HOST "node-u06a-016", TRACKD process-ID "76046".
PNP: Received    [2016-07-21-12:29:13] SIGKILL from HOST "node-u06a-015", REQUESTER process-ID "75196".
PNP: Shutdown    [2016-07-21-12:29:19] Execution stopped due to process failure (SIGCHLD) after 146.71 seconds (OVERALL ELAPSED TIME).
heatherkellyucl commented 8 years ago

4.26.011 made no difference. CD-adapco have asked for our test case, the output of MPI logged runs and how we ran these to be uploaded to their FTP server, and they will run with the same flags and see if the difference in output tells us anything.

I am putting these together for a default MPI and an Intel MPI run using Star-CD 4.26.011.

I did rediscover why we don't source setstar - it tries to add a public key to the user's ~/.ssh/authorized_keys, which is obviously not writeable from the compute nodes.

source /shared/ucl/apps/STAR-CD/4.26.011/etc/setstar
SGE 8.1.8 resource manager detected
*** Using STARNET plugin "$STARNET/lib/starnet.gridengine" ***

STAR-NET 3.00.024 interface initialized for dynamic node allocation.

PNP_JOBMANAGER  = SGE 8.1.8
PNP_JOBRMINIT   = . /opt/sge/default/common/settings.sh
PNP_JOBRESOURCE = qlc-X
PNP_JOBNODES    = node-x02c-018,12
PNP_JOBID       = 1597203
PNP_JOBPGID     = 3562
PNP_JOBDIR      = /home/cceahke/Scratch/output/starcd/test-star-4.26

Max Time (CPU)  = unlimited seconds
Max Stack size  = unlimited KB
Max Data size   = unlimited KB
Max Memory size = unlimited KB
Max File size   = unlimited blocks

 STARDIR:            /shared/ucl/apps/STAR-CD/4.26.011
 STARINI:            Default
 STARFLAGS:          -nodefile /tmpdir/job/1597203.undefined/machines -scratch=/tmpdir/job/1597203.undefined
 CDLMD_LICENSE_FILE: 1999@ntsrv1.meng.ucl.ac.uk
 LM_LICENSE_FILE:    1999@ntsrv1.meng.ucl.ac.uk

 STARPLUGIN_DARSCFD: $STARDIR/DARSCFD/2.05.009
 STARPLUGIN_DARSTABLE: $STARDIR/DARSTABLE/4.26.001
 STARPLUGIN_DARSTIF: $STARDIR/DARSTIF/2.08.015
 STARPLUGIN_ICE:     $STARDIR/STARICE/4.26.014
 STARPLUGIN_SOOTNOX: $STARDIR/SOOTNOX/1.04.007
 STARPLUGIN_WAVE:    $STARDIR/WAVE/2.14.006

 IBMMPI:             $STARDIR/IBMMPI/8.3.0.2isv/linux64_2.6-x86-glibc_2.3.4
 ICE:                $STARDIR/ICE/4.26.014/linux64_2.6-x86-glibc_2.5.0-gcc_4.4.3-ifort_11.0
 INTELMPI:           $STARDIR/INTELMPI/4.1.1.036/linux64_2.6-x86_glibc_2.3.4
 OPENMPI:            $STARDIR/OPENMPI/1.6.2/linux64_2.6-gcc_3.4.6-glibc_2.3.4
 PCMPI:              $STARDIR/PCMPI/8.1.1.0/linux64_2.6-x86-glibc_2.3.4
 PROSTAR:            $STARDIR/PROSTAR/4.26.001/linux64_2.6-x86-glibc_2.3.4-gcc_4.4.3-ifort_11.0
 STAR:               $STARDIR/STAR/4.26.011/linux64_2.6-ifort_16.0-glibc_2.5
 STARCCMP:           $STARDIR/STARCCMP/STAR-CCM+11.02.010/star
 STARCDHEEDS:        $STARDIR/HEEDS/Ver2016.04
 STARCDMAN:          $STARDIR/STARCDMAN/4.26.007/generic
 STARDATA:           $STARDIR/STARDATA/3.01.011/generic
 STARNET:            $STARDIR/STARNET/3.00.024/generic

Adding public key to "$HOME/.ssh/authorized_keys"...
/shared/ucl/apps/STAR-CD/4.26.011/etc/starenv: line 28: /home/cceahke/.ssh/authorized_keys: Permission denied
heatherkellyucl commented 8 years ago

Uploaded tar to their ftp containing the test case, the output from an IBM MPI run and an Intel MPI run (both on 24 cores and with MPI debugging on), the jobscripts for those runs, and a copy of our module file so they can see what environment variables we set.

heatherkellyucl commented 8 years ago

Case was escalated: asked to run $STARDIR/bin/detect and ran it on an X and a U node - it reports on the hardware and software.

We were also supplied with a simpler test case called simpleSprayBox - I ran it on 24 cores across two U nodes and it also died after 431 iterations and 143.32s execution.

heatherkellyucl commented 8 years ago

Now running simpleSprayBox on 24 cores with the -notracker option to see if that helps.

STAR-CD normally runs two background processes, we call trackers, to detect STAR processes' or node failures. In the past, we had cases when the ping and/or ssh did not work properly causing the tracker to kill the jobs. The option "-notracker" disables these two monitoring processes.

heatherkellyucl commented 8 years ago

star -notracker worked! The simpleSprayBox test ran across 2 X nodes and completed in 1486.24s. I checked that it was definitely running 12 processes on both nodes.

heatherkellyucl commented 8 years ago

The static1.75RNG original user example has also been running approaching 2hrs and counting.

Need to ltrace the tracker process to find out why it kills the runs.

heatherkellyucl commented 8 years ago

When there is tracking, on the head node there are two processes with trackd in the name:

  39945 cceahke    20   0  134M  8172  2528 S  0.0  0.0  0:00.04 ├─ /shared/ucl/apps/STAR-CD/4.26.011/PYTHON/3.3.5/linux64_2.6-x86-gcc_3.4.6-glibc_2.3.4/bin/python3.3 /shared/ucl/apps/STAR-CD/4.26.011/STAR/4.26.011/linux64_2.6-ifort_16.0-glibc_2.5/bin/syslib.py /shared/ucl/apps/STAR-CD/4.26.011/STAR/4.26.011/linux64_2.6-ifort_16.0-glibc_2.5/bin/star -trackd /scratch/scratch/cceahke/output/starcd/test-star-4.26 node-u05a-013
  39946 cceahke    20   0  144M 10616  2656 S  0.0  0.0  0:00.13 │  └─ /shared/ucl/apps/STAR-CD/4.26.011/PYTHON/3.3.5/linux64_2.6-x86-gcc_3.4.6-glibc_2.3.4/bin/python3.3 /shared/ucl/apps/STAR-CD/4.26.011/STAR/4.26.011/linux64_2.6-ifort_16.0-glibc_2.5/bin/star -trackd /scratch/scratch/cceahke/output/starcd/test-star-4.26 node-u05a-013
  41122 cceahke    20   0 87180  3788  2836 S  0.0  0.0  0:00.01 │     └─ /usr/bin/ssh -o StrictHostKeyChecking=no node-u05a-008 /shared/ucl/apps/STAR-CD/4.26.011/etc/starenv -run Default / star -ps
  41141 cceahke    20   0 77872  6844  2668 S  0.0  0.0  0:00.00 │        └─ qrsh -V -inherit node-u05a-008 /opt/geassist/bin/sshorig
  41150 cceahke    20   0 50020  2920  2204 S  0.0  0.0  0:00.00 │           └─ /usr/bin/nc node-u05a-008 55997

If I trace the initial one then there is only output when the kill happens, no indication of what caused it.

ltrace -S --demangle -f -o star-24-trackerhead_${JOB_ID}.ltrace -s 128 -p <pid>:

55364 SYS_read(5, "PNP: Sending     SIGKILL to terminate STAR immediately.\n", 131072)                 = 56
55364 malloc(16)                                                                                       = 0x26559d0
55364 free(0x26559d0)                                                                                  = <void>
55364 __errno_location()                                                                               = 0x2afc7b826c00
55364 read(5 <unfinished ...>
55364 SYS_read(5, "", 126976)                                                                          = 0
55364 --- SIGCHLD (Child exited) ---
55364 <... read resumed> , "", 126976)                                                                 = 0
55364 realloc(0x27ef8f0, 89)                                                                           = 0x27ef8f0
55364 sem_post(0x2738d90, 0, 33, 0x27ef950)                                                            = 0
55364 free(0x27ef8f0)                                                                                  = <void>
55364 malloc(16)                                                                                       = 0x2741ad0
55364 free(0x2741ad0)                                                                                  = <void>
55364 sem_trywait(0x2738d90, 0, 0, 0)                                                                  = 0

...

strace -f -ostar-24-trackerhead_${JOB_ID}.strace -p <pid> (entire log):

45966 read(5, "PNP: Sending     SIGKILL to term"..., 131072) = 56
45966 read(5, "", 126976)               = 0
45966 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=45967, si_uid=74700, si_status=1, si_utime=9, si_stime=5} ---
45966 read(5, "", 131072)               = 0
45966 wait4(45967, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 45967
45966 close(5)                          = 0
45966 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x2adbcfe4e100}, {0x4b71f0, [], SA_RESTORER, 0x2adbcfe4e100}, 8) = 0
45966 exit_group(0)                     = ?
45966 +++ exited with 0 +++

Then I did a couple that tried to trace the child process, but they have both exited with the different PNP: Received [2016-08-17-16:44:16] SIGPIPE from HOST "node-u05a-005", SLAVE host (Too many "ping" failures from HOST "node-u05a-004"). error which is not what we were getting before. (And unlike the previous error, this one kills the qrsh session so I can't rerun to check whether it fails the same way with no trace, and need to wait for another qrsh to be scheduled).

heatherkellyucl commented 8 years ago

The ping error is being caused by strace and ltrace -S this time: they can't trace ping when not run as root in the current setup.

(I got a pair of nodes, ran star with no tracing, got the normal error, traced with ltrace -S and got the ping error. Incidentally, when star dies it leaves behind the trackd processes, showing as sleeping. So I kill those manually before starting another one when running interactively).

heatherkellyucl commented 8 years ago

I tried ltracing without following child processes, which is not very illuminating.

It has some illegal seeks, and some attempts to kill processes that no longer exist.

118293 sem_init(0x1befd50, 0, 1, 0)                                                             = 0
118293 lseek64(5, 0, 1, 0 <unfinished ...>
118293 SYS_lseek(5, 0, 1)                                                                       = -29
118293 <... lseek64 resumed> )                                                                  = -1
118293 __errno_location()                                                                       = 0x2b89f1ef4c00
118293 strerror(29)                                                                             = "Illegal seek"
118293 strlen("Illegal seek")                                                                   = 12
118293 strlen("Illegal seek")                                                                   = 12
118293 mbstowcs(0, 0x2b89f2b82736, 0, 0xffffd4760d47d906)                                       = 12
118293 SYS_read(5, "/bin/bash: line 0: kill: (39423) - No such process\n", 4096)                = 51
118293 <... read resumed> , "/bin/bash: line 0: kill: (39423) - No such process\n", 4096)       = 51
118293 memcpy(0x2b89fab02100, "/bin/bash: line 0: kill: (39423) - No such process\n", 51)       = 0x2b89fab02100
118293 realloc(0, 32 <no return ...>
118293 --- SIGCHLD (Child exited) ---
118293 <... realloc resumed> )                                                                  = 0x1c00270
heatherkellyucl commented 8 years ago

There is a copy of simpleSprayBox in /shared/ucl/apps/STAR-CD/examples/ if anyone wants to have a go tracing it as root while I am away.

module load star-cd/4.26.011
star
ikirker commented 7 years ago

Recommended use of workaround, no further progress on solving problem with tracker, closing.