CTS-1 program launch testing

grondo commented 8 years ago

_Goals_

Test scalability and usability of Flux program launch on full system. Determine any bugs or scaling and usability issues.

_Methodology_

Launch and collect timing data for a series of programs, both MPI and non-MPI, and compare with baseline SLURM launch data as collected in #13. Utilize and/or enhance instrumentation already extant in flux-wreckrun and record the timing of phases, including

state=reserved -> state=starting
state=starting -> state=running
state=complete -> wreckrun exited as well as the entire time to fully execute each parallel program from a hot cache.

Run these tests through a similar scale as the baseline described in #13, with enough samples for statistical validity. Vary the number of tasks per broker rank as well as the number of total tasks for each program. Publish results in this issue.

Time permitting, include scale testing a program with increasing amounts of stdio and record impact to runtime (Tcompleted - Trunning).

_exit criteria_

Able to run full scale program: flux wreckrun -n $((2304*cores_per_node)) mpi_hello
No failures for 10 runs of unusual size (ROUS), either full system as in 1. above or a TBD threshold
I/O for full scale program is captured to kvs

_Issues to watch out for_

Scaling of kz io to KVS is unknown. Full system program launch should be verified before attempting to launch a program with even moderate I/O.
Investigate usage of persist-directory to ensure content-store doesn't fill up local tmp or tmpfs during these runs.

grondo commented 8 years ago

Looks like current default mpi when I log in to opal still requires PMGR, I may open a TOSS-specific Jira issue on this unless this is a known issue already?

(flux-101740) grondo@opal19:~/git/flux-core.git/t/mpi$ flux wreckrun -n4 -N2 ./hello
srun: mvapich: 2016-06-22T14:51:22: ABORT from MPI rank 0 [on opal19] dest rank 9 [on (null)]
             PMGR_COLLECTIVE ERROR: rank 0 on opal20: Reading from mpirun at 192.168.64.19:51697 (read(buf=609650,size=12) Success errno=0) @ file pmgr_collective_client_mpirun.c:60
PMGR_COLLECTIVE ERROR: rank 0 on opal20: Nonblocking connect failed immediately connecting to 0.0.200.215:44096 (connect() Invalid argument errno=22) @ file pmgr_collective_client_common.c:86
srun: mvapich: 2016-06-22T14:51:22: ABORT from MPI rank 1 [on opal20] dest rank 9 [on (null)]
             Hangup

grondo commented 8 years ago

Part of the problem is default mpicc on opal seems to build with rpath :-(

grondo@opal19:~/git/flux-core.git/t/mpi$ readelf --dynamic hello | grep PATH
 0x000000000000000f (RPATH)              Library rpath: [/usr/tce/packages/pmgr/pmgr-1.0/lib:/usr/tce/packages/mvapich2/mvapich2-2.2-intel-16.0.3/lib]

dongahn commented 8 years ago

Just FYI -- At least I don't plan to scale the overcommit factor that much as part of my testing. The max factor for #14 is 32.

grondo commented 8 years ago

Running into some problems sanity testing wreckrun up to even 2048 tasks on jade. It seems like some kvs watch callbacks aren't making back to the Lua script, I'll have to debug this further:

 grondo@jade1:~/git/flux-core.git$ flux wreckrun -v -n512 /bin/true 
wreckrun: 0.004s: Sending LWJ request for 512 tasks (cmdline "/bin/true")
wreckrun: 0.010s: Registered jobid 13
wreckrun: Allocating 512 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-511]: 1
wreckrun: 0.097s: Sending run event
wreckrun: 0.325s: State = reserved
wreckrun: 3.560s: State = starting
wreckrun: 3.560s: State = running
wreckrun: 3.566s: State = complete
wreckrun: tasks [0-511]: exited with exit code 0
wreckrun: All tasks completed successfully.

Note above we get notified of each state transition reserved->starting->running->complete.

$ flux wreckrun -v -n1024 /bin/true
wreckrun: 0.003s: Sending LWJ request for 1024 tasks (cmdline "/bin/true")
wreckrun: 0.008s: Registered jobid 15
wreckrun: Allocating 1024 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-1023]: 1
wreckrun: 0.208s: Sending run event
wreckrun: 6.955s: State = complete
wreckrun: tasks [0-1023]: exited with exit code 0
wreckrun: All tasks completed successfully.

in this run, the script appears to miss all the events up until complete. The state is communicated back to flux-wreckrun via kvs watch on lwj.<id>.state. There could be an issue in the Lua bindings or something here. I also wonder if the script is very busy processing kzio files, if a bug could cause it to skip the kvs watch callbacks.

This is further detailed in flux-framework/flux-core#772

grondo commented 8 years ago

Ok, thanks to @SteVwonder we found that missing state transitions are simply due to the fact that flux-wreckrun optimizes for issuing the wrexec.run request asap, and kvs_watch is actually issued after the run request so it is no surprise that some states are missing.

However, there appear to be other problems with flux-wreckrun that cause hangs or very slow progress with many tasks.

However, jobs still run to completion, so data for this part of the milestone can still be gathered via use of flux wreckrun -d .... Results can be gathered from flux wreck timing

grondo commented 8 years ago

I was able to grab 512 nodes on jade this morning, and get some preliminary results of hostname runs, scaling up the number of tasks per node:

(flux--1-5rZ) grondo@jade1:~/git/flux-core.git$ flux wreck timing
    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1          512       0.251s       0.390s       0.127s       0.518s
     2         1024       0.226s       0.498s       0.361s       0.860s
     3         2048       0.229s       0.498s       1.162s       1.660s
     4         4096       0.222s       0.574s       3.590s       4.164s
     5         6144       0.189s       0.694s       6.747s       7.441s
     6         8192       0.205s       0.843s      11.760s      12.603s
     7        12288       0.237s       1.062s      46.202s      47.265s
     8        16384       5.067s       1.256s      10.940m      10.961m
     9        16384       0.312s       1.455s       3.901s       5.356s
    10        32768       0.469s       2.597s       7.798s      10.395s

All of these were run with default stdio commit settings, except lwj 9, and 10 which used -o stdio-delay-commit.

grondo commented 8 years ago

Got time with 2048 nodes, running at 64K hostname tasks though I get the following error


$ flux wreckrun -o stdio-delay-commit -v -I -n $((2048*32)) hostname
wreckrun: 0.019s: Sending LWJ request for 65536 tasks (cmdline "hostname")
wreckrun: 0.030s: Registered jobid 4
wreckrun: 0.031s: State = reserved
wreckrun: Allocating 65536 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 32
wreckrun: 0.536s: Sending run event
wreckrun: 9.973s: State = starting
2016-08-18T23:16:10.776056Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987379Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987415Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987443Z kvs.err[0]: content_store: Device or resource busy
wreckrun: 20.422s: State = running
2016-08-18T23:16:14.585425Z kvs.err[2]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585587Z kvs.err[5]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585660Z kvs.err[6]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585868Z kvs.err[12]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585881Z kvs.err[14]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585951Z kvs.err[26]: content_load_completion: No such file or directory

This happened with and without persist-filesystem set to /nfs/tmp2/grondo.

grondo commented 8 years ago

Btw, here I'm using a new (kludgy) -I, --ignore-stdio option to flux-wreckrun which skips the kzio watches (and thus ignores stdout/err), which seems to resolve the "hangs" we were seeing above (likely the script is just busy processing all the initial kz callbacks)

grondo commented 8 years ago

Before my session exits, heres results up to 32K tasks on 2048 nodes:

$ for i in 2 4 8 16; do flux wreckrun -o stdio-delay-commit -v -I -n $((2048*${i})) hostname; done
wreckrun: 0.005s: Sending LWJ request for 4096 tasks (cmdline "hostname")
wreckrun: 0.011s: Registered jobid 1
wreckrun: 0.012s: State = reserved
wreckrun: Allocating 4096 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 2
wreckrun: 0.520s: Sending run event
wreckrun: 6.588s: State = starting
wreckrun: 6.588s: State = running
wreckrun: 6.588s: State = complete
wreckrun: tasks [0-4095]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.002s: Sending LWJ request for 8192 tasks (cmdline "hostname")
wreckrun: 0.006s: Registered jobid 2
wreckrun: 0.007s: State = reserved
wreckrun: Allocating 8192 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 4
wreckrun: 0.515s: Sending run event
wreckrun: 8.464s: State = starting
wreckrun: 8.464s: State = running
wreckrun: 8.464s: State = complete
wreckrun: tasks [0-8191]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.002s: Sending LWJ request for 16384 tasks (cmdline "hostname")
wreckrun: 0.006s: Registered jobid 3
wreckrun: 0.006s: State = reserved
wreckrun: Allocating 16384 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 8
wreckrun: 0.482s: Sending run event
wreckrun: 9.587s: State = starting
wreckrun: 9.587s: State = running
wreckrun: 13.110s: State = complete
wreckrun: tasks [0-16383]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.005s: Sending LWJ request for 32768 tasks (cmdline "hostname")
wreckrun: 0.010s: Registered jobid 4
wreckrun: 0.011s: State = reserved
wreckrun: Allocating 32768 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 16
wreckrun: 0.558s: Sending run event
wreckrun: 7.958s: State = starting
wreckrun: 11.682s: State = running
wreckrun: 18.464s: State = complete
wreckrun: tasks [0-32767]: exited with exit code 0
wreckrun: All tasks completed successfully.
$ flux wreck timing
    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1         4096       3.209s       0.847s       0.898s       1.745s
     2         8192       1.092s       3.271s       1.710s       4.981s
     4        32768       4.256s       2.145s       6.858s       9.003s
     3        16384       1.114s       3.942s       4.139s       8.081s

Note that flux-wreckrun prints the timestamp at the time it processes these events, while the timing from flux wreck timing is from rank 0 wrexecd and is inserted into kvs (so more of an actual time the event/state occurred, while the flux-wreckrun timestamps are a realistic measure of how soon the state change/kvs watch can be processed by an actor)

grondo commented 8 years ago

I did verify that a 16K MPI job works

0: 0: completed MPI_Init in 40.577s.  There are 16384 tasks
0: 0: completed first barrier in 0.095s
0: 0: completed MPI_Finalize in 0.666s

grondo commented 8 years ago

I was able to get some more runs this morning, including a ~43K task mpi hello job:

0: completed MPI_Init in 262.517s.  There are 44320 tasks
0: completed first barrier in 0.554s
0: completed MPI_Finalize in 1.868s

This was launched across 2216 nodes. Other jobs run were test runs of /bin/true:

$ flux wreck timing
    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1         2216       4.897s       1.493s       0.516s       2.009s
     2        35456       1.054s       5.291s       6.849s      12.139s
     3        44320       1.005s       5.472s       8.652s      14.124s
     4        44320       1.214s       6.273s       4.417m       4.522m  # mpi_hello
     5        53184       1.194s       6.167s      11.501s      17.668s
     6        62048       1.193s       7.723s      13.745s      21.468s

I got up to 28 tasks per node before hitting the issue in the comment above.

flux-framework / distribution

CTS-1 program launch testing #15