Open grondo opened 8 years ago
Looks like current default mpi when I log in to opal still requires PMGR, I may open a TOSS-specific Jira issue on this unless this is a known issue already?
(flux-101740) grondo@opal19:~/git/flux-core.git/t/mpi$ flux wreckrun -n4 -N2 ./hello
srun: mvapich: 2016-06-22T14:51:22: ABORT from MPI rank 0 [on opal19] dest rank 9 [on (null)]
PMGR_COLLECTIVE ERROR: rank 0 on opal20: Reading from mpirun at 192.168.64.19:51697 (read(buf=609650,size=12) Success errno=0) @ file pmgr_collective_client_mpirun.c:60
PMGR_COLLECTIVE ERROR: rank 0 on opal20: Nonblocking connect failed immediately connecting to 0.0.200.215:44096 (connect() Invalid argument errno=22) @ file pmgr_collective_client_common.c:86
srun: mvapich: 2016-06-22T14:51:22: ABORT from MPI rank 1 [on opal20] dest rank 9 [on (null)]
Hangup
Part of the problem is default mpicc on opal seems to build with rpath :-(
grondo@opal19:~/git/flux-core.git/t/mpi$ readelf --dynamic hello | grep PATH
0x000000000000000f (RPATH) Library rpath: [/usr/tce/packages/pmgr/pmgr-1.0/lib:/usr/tce/packages/mvapich2/mvapich2-2.2-intel-16.0.3/lib]
Just FYI -- At least I don't plan to scale the overcommit factor that much as part of my testing. The max factor for #14 is 32.
Running into some problems sanity testing wreckrun
up to even 2048 tasks on jade. It seems like some kvs watch callbacks aren't making back to the Lua script, I'll have to debug this further:
grondo@jade1:~/git/flux-core.git$ flux wreckrun -v -n512 /bin/true
wreckrun: 0.004s: Sending LWJ request for 512 tasks (cmdline "/bin/true")
wreckrun: 0.010s: Registered jobid 13
wreckrun: Allocating 512 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-511]: 1
wreckrun: 0.097s: Sending run event
wreckrun: 0.325s: State = reserved
wreckrun: 3.560s: State = starting
wreckrun: 3.560s: State = running
wreckrun: 3.566s: State = complete
wreckrun: tasks [0-511]: exited with exit code 0
wreckrun: All tasks completed successfully.
Note above we get notified of each state transition reserved->starting->running->complete.
$ flux wreckrun -v -n1024 /bin/true
wreckrun: 0.003s: Sending LWJ request for 1024 tasks (cmdline "/bin/true")
wreckrun: 0.008s: Registered jobid 15
wreckrun: Allocating 1024 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-1023]: 1
wreckrun: 0.208s: Sending run event
wreckrun: 6.955s: State = complete
wreckrun: tasks [0-1023]: exited with exit code 0
wreckrun: All tasks completed successfully.
in this run, the script appears to miss all the events up until complete
. The state is communicated back to flux-wreckrun
via kvs watch on lwj.<id>.state
. There could be an issue in the Lua bindings or something here. I also wonder if the script is very busy processing kzio files, if a bug could cause it to skip the kvs watch callbacks.
This is further detailed in flux-framework/flux-core#772
Ok, thanks to @SteVwonder we found that missing state transitions are simply due to the fact that flux-wreckrun
optimizes for issuing the wrexec.run
request asap, and kvs_watch
is actually issued after the run request so it is no surprise that some states are missing.
However, there appear to be other problems with flux-wreckrun
that cause hangs or very slow progress with many tasks.
However, jobs still run to completion, so data for this part of the milestone can still be gathered via use of flux wreckrun -d ...
. Results can be gathered from flux wreck timing
I was able to grab 512 nodes on jade this morning, and get some preliminary results of hostname
runs, scaling up the number of tasks per node:
(flux--1-5rZ) grondo@jade1:~/git/flux-core.git$ flux wreck timing
ID NTASKS STARTING RUNNING COMPLETE TOTAL
1 512 0.251s 0.390s 0.127s 0.518s
2 1024 0.226s 0.498s 0.361s 0.860s
3 2048 0.229s 0.498s 1.162s 1.660s
4 4096 0.222s 0.574s 3.590s 4.164s
5 6144 0.189s 0.694s 6.747s 7.441s
6 8192 0.205s 0.843s 11.760s 12.603s
7 12288 0.237s 1.062s 46.202s 47.265s
8 16384 5.067s 1.256s 10.940m 10.961m
9 16384 0.312s 1.455s 3.901s 5.356s
10 32768 0.469s 2.597s 7.798s 10.395s
All of these were run with default stdio commit settings, except lwj 9, and 10 which used -o stdio-delay-commit
.
Got time with 2048 nodes, running at 64K hostname
tasks though I get the following error
$ flux wreckrun -o stdio-delay-commit -v -I -n $((2048*32)) hostname
wreckrun: 0.019s: Sending LWJ request for 65536 tasks (cmdline "hostname")
wreckrun: 0.030s: Registered jobid 4
wreckrun: 0.031s: State = reserved
wreckrun: Allocating 65536 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 32
wreckrun: 0.536s: Sending run event
wreckrun: 9.973s: State = starting
2016-08-18T23:16:10.776056Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987379Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987415Z kvs.err[0]: content_store: Device or resource busy
2016-08-18T23:16:10.987443Z kvs.err[0]: content_store: Device or resource busy
wreckrun: 20.422s: State = running
2016-08-18T23:16:14.585425Z kvs.err[2]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585587Z kvs.err[5]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585660Z kvs.err[6]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585868Z kvs.err[12]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585881Z kvs.err[14]: content_load_completion: No such file or directory
2016-08-18T23:16:14.585951Z kvs.err[26]: content_load_completion: No such file or directory
This happened with and without persist-filesystem
set to /nfs/tmp2/grondo
.
Btw, here I'm using a new (kludgy) -I, --ignore-stdio
option to flux-wreckrun
which skips the kzio watches (and thus ignores stdout/err), which seems to resolve the "hangs" we were seeing above (likely the script is just busy processing all the initial kz callbacks)
Before my session exits, heres results up to 32K tasks on 2048 nodes:
$ for i in 2 4 8 16; do flux wreckrun -o stdio-delay-commit -v -I -n $((2048*${i})) hostname; done
wreckrun: 0.005s: Sending LWJ request for 4096 tasks (cmdline "hostname")
wreckrun: 0.011s: Registered jobid 1
wreckrun: 0.012s: State = reserved
wreckrun: Allocating 4096 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 2
wreckrun: 0.520s: Sending run event
wreckrun: 6.588s: State = starting
wreckrun: 6.588s: State = running
wreckrun: 6.588s: State = complete
wreckrun: tasks [0-4095]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.002s: Sending LWJ request for 8192 tasks (cmdline "hostname")
wreckrun: 0.006s: Registered jobid 2
wreckrun: 0.007s: State = reserved
wreckrun: Allocating 8192 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 4
wreckrun: 0.515s: Sending run event
wreckrun: 8.464s: State = starting
wreckrun: 8.464s: State = running
wreckrun: 8.464s: State = complete
wreckrun: tasks [0-8191]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.002s: Sending LWJ request for 16384 tasks (cmdline "hostname")
wreckrun: 0.006s: Registered jobid 3
wreckrun: 0.006s: State = reserved
wreckrun: Allocating 16384 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 8
wreckrun: 0.482s: Sending run event
wreckrun: 9.587s: State = starting
wreckrun: 9.587s: State = running
wreckrun: 13.110s: State = complete
wreckrun: tasks [0-16383]: exited with exit code 0
wreckrun: All tasks completed successfully.
wreckrun: 0.005s: Sending LWJ request for 32768 tasks (cmdline "hostname")
wreckrun: 0.010s: Registered jobid 4
wreckrun: 0.011s: State = reserved
wreckrun: Allocating 32768 tasks across 2048 available nodes..
wreckrun: tasks per node: node[0-2047]: 16
wreckrun: 0.558s: Sending run event
wreckrun: 7.958s: State = starting
wreckrun: 11.682s: State = running
wreckrun: 18.464s: State = complete
wreckrun: tasks [0-32767]: exited with exit code 0
wreckrun: All tasks completed successfully.
$ flux wreck timing
ID NTASKS STARTING RUNNING COMPLETE TOTAL
1 4096 3.209s 0.847s 0.898s 1.745s
2 8192 1.092s 3.271s 1.710s 4.981s
4 32768 4.256s 2.145s 6.858s 9.003s
3 16384 1.114s 3.942s 4.139s 8.081s
Note that flux-wreckrun
prints the timestamp at the time it processes these events, while the timing from flux wreck timing
is from rank 0 wrexecd
and is inserted into kvs (so more of an actual time the event/state occurred, while the flux-wreckrun
timestamps are a realistic measure of how soon the state change/kvs watch can be processed by an actor)
I did verify that a 16K MPI job works
0: 0: completed MPI_Init in 40.577s. There are 16384 tasks
0: 0: completed first barrier in 0.095s
0: 0: completed MPI_Finalize in 0.666s
I was able to get some more runs this morning, including a ~43K task mpi hello job:
0: completed MPI_Init in 262.517s. There are 44320 tasks
0: completed first barrier in 0.554s
0: completed MPI_Finalize in 1.868s
This was launched across 2216 nodes. Other jobs run were test runs of /bin/true
:
$ flux wreck timing
ID NTASKS STARTING RUNNING COMPLETE TOTAL
1 2216 4.897s 1.493s 0.516s 2.009s
2 35456 1.054s 5.291s 6.849s 12.139s
3 44320 1.005s 5.472s 8.652s 14.124s
4 44320 1.214s 6.273s 4.417m 4.522m # mpi_hello
5 53184 1.194s 6.167s 11.501s 17.668s
6 62048 1.193s 7.723s 13.745s 21.468s
I got up to 28 tasks per node before hitting the issue in the comment above.
_Goals_
Test scalability and usability of Flux program launch on full system. Determine any bugs or scaling and usability issues.
_Methodology_
Launch and collect timing data for a series of programs, both MPI and non-MPI, and compare with baseline SLURM launch data as collected in #13. Utilize and/or enhance instrumentation already extant in
flux-wreckrun
and record the timing of phases, includingRun these tests through a similar scale as the baseline described in #13, with enough samples for statistical validity. Vary the number of tasks per broker rank as well as the number of total tasks for each program. Publish results in this issue.
Time permitting, include scale testing a program with increasing amounts of stdio and record impact to runtime (Tcompleted - Trunning).
_exit criteria_
flux wreckrun -n $((2304*cores_per_node)) mpi_hello
_Issues to watch out for_
persist-directory
to ensure content-store doesn't fill up local tmp or tmpfs during these runs.