OpenSpeedShop / openspeedshop

Open|SpeedShop is a community effort by The Krell Institute with current direct funding from DOE’s NNSA and Office of Science. It is building on top of a broad list of community infrastructures, most notably Dyninst and MRNet from UW, libmonitor from Rice, and PAPI from UTK. Open|SpeedShop is an open source multi platform Linux performance tool which is targeted to support performance analysis of applications running on both single node and large scale Intel, AMD, ARM, Intel Phi, PPC, GPU processor based systems and on Blue Gene and Cray platforms.
https://www.openspeedshop.org
Other
25 stars 10 forks source link

Better detection of number of ranks specification usage on OSS convenience scripts #1

Open jgalarowicz opened 7 years ago

jgalarowicz commented 7 years ago

The convenience script specification for number of ranks isn't robust enough.
srun -n 4 works but -n4 doesn't. i.e. osspcsamp "srun -n4 ./nbody". srun with no rank specifier does not work either: osspcsamp "srun ./nbody"

From Martin S: It seems to be very specific, though, since also “-n4” didn’t work (it needed the space). It would be good if we could make that a bit more general (space/no space, no -n argument, etc.).

However, when I run O|SS (the CBTF version), something breaks - for one, the scripts seem to grab the wrong “-n” from the command line and launch too many backends:

g23/schulz/prgs/smg2000/test> osspcsamp "srun ./smg2000 -n 50 50 50" [openss]: pcsamp experiment using the default sampling rate: "100". Creating topology file for slurm frontend node cab5 for SLURM_JOB_ID 2264568 Generated topology file: ./cbtfAutoTopology Running pcsamp collector. Program: srun ./smg2000 -n 50 50 50 Number of mrnet backends: 50 Topology file used: ./cbtfAutoTopology executing mpi program: srun cbtfrun --mpi --mrnet -c pcsamp ./smg2000 -n 50 50 50 ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-3: running 174940133.251075: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:55994:3 174940133.251013: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47034:1 174940133.251037: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47034:1 174940133.251041: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:47563:2 ^Csrun: sending Ctrl-C to job 2264568.3 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[cab5]: STEP 2264568.3 KILLED AT 2016-12-30T15:28:54 WITH SIGNAL 9

(the -n 50 is an argument for code, not srun - which has its node number from a prior alloc)

When I remove the -n and run plain, things still get stuck:

g23/schulz/prgs/smg2000/test> osspcsamp "srun smg2000" [openss]: pcsamp experiment using the default sampling rate: "100". Creating topology file for slurm frontend node cab5 for SLURM_JOB_ID 2264568 Generated topology file: ./cbtfAutoTopology Running pcsamp collector. Program: srun smg2000 Number of mrnet backends: 1 Topology file used: ./cbtfAutoTopology executing mpi program: srun cbtfrun --mpi --mrnet -c pcsamp smg2000 CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections CBTF_MRNet_LW_connect: Failed for myRank 10001, mrank 10001, con_rank 1 CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections CBTF_MRNet_LW_connect: Failed for myRank 10002, mrank 10002, con_rank 2 CBTF_MRNet_LW_connect: Failed to parse connections file /g/g23/schulz/prgs/smg2000/test/attachBE_connections CBTF_MRNet_LW_connect: Failed for myRank 10003, mrank 10003, con_rank 3 Running with these driver parameters: (nx, ny, nz) = (10, 10, 10) (Px, Py, Pz) = (4, 1, 1) (bx, by, bz) = (1, 1, 1) (cx, cy, cz) = (1.000000, 1.000000, 1.000000) (n_pre, n_post) = (1, 1) dim = 3 solver ID = 0 ^Csrun: interrupt (one more within 1 sec to abort) srun: tasks 0-3: running ^Csrun: sending Ctrl-C to job 2264568.6 174940208.775166: Message.c[305] Message_send - MRN_send failed 174940208.776290: PeerNode.c[176] PeerNode_sendDirectly - Message_send() failed 174940208.776293: Network.c[839] Network_send_PacketToParent - upstream.send() failed 174940208.776295: Network.c[842] Network_send_PacketToParent - assume parent failure, try one more time 174940208.776332: Network.c[1030] Network_recover_FromParentFailure - RECOVERY: NEW PARENT: cab5.llnl.gov:46397:1 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[cab5]: STEP 2264568.6 KILLED AT 2016-12-30T15:30:08 WITH SIGNAL 9 ^C

the mentioned files look OK:

g23/schulz/prgs/smg2000/test> cat attachBE_connections cab5.llnl.gov 42880 0 0 g23/schulz/prgs/smg2000/test> u Linux cab5 2.6.32-642.6.2.1chaos.ch5.5.x86_64 #1 SMP Mon Oct 24 10:49:01 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux 15:30:19 up 25 days, 21:33, 0 users, load average: 0.60, 5.00, 9.29 g23/schulz/prgs/smg2000/test> cat cbtfAutoTopology cab5:0 => cab5:1;