Open garlick opened 8 years ago
Quick opal test:
$ time srun -N8 ./flux broker flux getattr size
8
real 0m1.866s
user 0m0.016s
sys 0m0.030s
$ time srun -N16 ./flux broker flux getattr size
16
real 0m2.568s
user 0m0.026s
sys 0m0.043s
$ time srun -N32 ./flux broker flux getattr size
32
real 0m2.859s
user 0m0.039s
sys 0m0.079s
$ time srun -N64 ./flux broker flux getattr size
64
real 0m11.841s
user 0m0.058s
sys 0m0.098s
$ time srun -N128 ./flux broker flux getattr size
128
real 0m7.147s
user 0m0.094s
sys 0m0.194s
$ time srun -N128 -n256 ./flux broker flux getattr size
256
real 0m9.238s
user 0m0.114s
sys 0m0.187s
$ time srun -N128 -n512 ./flux broker flux getattr size
512
real 0m16.258s
user 0m0.127s
sys 0m0.220s
$ time srun -N128 -n1024 ./flux broker flux getattr size
1024
real 0m39.199s
user 0m0.175s
sys 0m0.326s
(got core dump at size 2048 - see flux-framework/flux-core#683)
Baseline with mpi_hello
compiled with Intel MPI which uses PMI:
$ time srun -N 8 ./hello
0: completed MPI_Init in 0.163s. There are 8 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s
real 0m2.443s
user 0m0.033s
sys 0m0.047s
$ time srun -N 16 ./hello
0: completed MPI_Init in 0.671s. There are 16 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s
real 0m3.077s
user 0m0.047s
sys 0m0.081s
$ time srun -N 32 ./hello
0: completed MPI_Init in 0.973s. There are 32 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s
real 0m3.219s
user 0m0.107s
sys 0m0.130s
$ time srun -N 64 ./hello
0: completed MPI_Init in 1.376s. There are 64 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.004s
real 0m3.685s
user 0m0.182s
sys 0m0.193s
$ time srun -N 128 ./hello
0: completed MPI_Init in 2.320s. There are 128 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s
real 0m4.962s
user 0m0.265s
sys 0m0.367s
$ time srun -N 128 -n 256 ./hello
0: completed MPI_Init in 1.404s. There are 256 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.005s
real 0m4.205s
user 0m0.351s
sys 0m0.440s
$ time srun -N 128 -n 512 ./hello
0: completed MPI_Init in 2.408s. There are 512 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.007s
real 0m5.197s
user 0m0.396s
sys 0m0.757s
$ time srun -N 128 -n 1024 ./hello
0: completed MPI_Init in 4.480s. There are 1024 tasks
0: completed first barrier in 0.001s
0: completed MPI_Finalize in 2.043s
real 0m7.785s
user 0m0.470s
sys 0m0.984s
$ time srun -N 128 -n 2048 ./hello
0: completed MPI_Init in 7.954s. There are 2048 tasks
0: completed first barrier in 0.001s
0: completed MPI_Finalize in 4.245s
real 0m13.462s
user 0m0.732s
sys 0m1.564s
$ time srun -N 128 -n 4096 ./hello
0: completed MPI_Init in 17.283s. There are 4096 tasks
0: completed first barrier in 0.002s
0: completed MPI_Finalize in 8.338s
real 0m27.137s
user 0m1.431s
sys 0m3.099s
Now numbers rc1/rc3 disabled:
$ export FLUX_RC1_PATH=""
$ export FLUX_RC3_PATH=""
$ time srun -N8 ./flux broker flux getattr size
8
real 0m1.103s
user 0m0.019s
sys 0m0.034s
$ time srun -N16 ./flux broker flux getattr size
16
real 0m1.796s
user 0m0.026s
sys 0m0.052s
$ time srun -N32 ./flux broker flux getattr size
32
real 0m1.651s
user 0m0.054s
sys 0m0.062s
$ time srun -N64 ./flux broker flux getattr size
64
real 0m1.836s
user 0m0.048s
sys 0m0.093s
$ time srun -N128 ./flux broker flux getattr size
128
real 0m3.825s
user 0m0.108s
sys 0m0.159s
$ time srun -N128 -n256 ./flux broker flux getattr size
256
real 0m3.438s
user 0m0.123s
sys 0m0.182s
$ time srun -N128 -n512 ./flux broker flux getattr size
512
real 0m3.885s
user 0m0.141s
sys 0m0.236s
$ time srun -N128 -n1024 ./flux broker flux getattr size
1024
real 0m7.707s
user 0m0.221s
sys 0m0.481s
$ time srun -N128 -n2048 ./flux broker flux getattr size
2048
real 0m9.323s
user 0m0.329s
sys 0m0.859s
(core dump at size=4096 - see flux-framework/flux-core#683)
Conclusions:
I wonder if perf timechart
run on the broker 0 node would offer any insight as a first cut?
I was able to run perf timechart record su - grondo -- /path/to/flux start --size=8 flux COMMAND
and perf timechart
to get a chart, but I couldn't glean too much from it right away. On a TOSS 2 system I couldn't get -g
option to resolve symbols to work.
Here's a few more data points on opal, with rc1/rc3 edited to drop resource-hwloc. That doesn't improve things as well as one might hope!
bash-4.2$ time srun -N8 ./flux start flux getattr size
8
real 0m1.917s
user 0m0.017s
sys 0m0.029s
bash-4.2$ time srun -N16 ./flux start flux getattr size
16
real 0m2.143s
user 0m0.025s
sys 0m0.043s
bash-4.2$ time srun -N32 ./flux start flux getattr size
32
real 0m2.281s
user 0m0.045s
sys 0m0.050s
bash-4.2$ time srun -N64 ./flux start flux getattr size
64
real 0m2.636s
user 0m0.054s
sys 0m0.091s
bash-4.2$ time srun -N128 ./flux start flux getattr size
128
real 0m4.758s
user 0m0.117s
sys 0m0.153s
bash-4.2$ time srun -N128 -n256 ./flux start flux getattr size
256
real 0m6.550s
user 0m0.115s
sys 0m0.151s
bash-4.2$ time srun -N128 -n512 ./flux start flux getattr size
512
real 0m12.231s
user 0m0.134s
sys 0m0.203s
bash-4.2$ time srun -N128 -n1024 ./flux start flux getattr size
[1464387480.415573] broker.crit[409]: child 819 idle for 3 heartbeats
[1464387480.415660] broker.crit[411]: child 824 idle for 3 heartbeats
[1464387480.415641] broker.crit[401]: child 803 idle for 3 heartbeats
[1464387480.415668] broker.crit[439]: child 879 idle for 3 heartbeats
[1464387480.415739] broker.crit[376]: child 753 idle for 3 heartbeats
[1464387480.415772] broker.crit[438]: child 878 idle for 3 heartbeats
[1464387480.415849] broker.crit[438]: child 877 idle for 3 heartbeats
[1464387480.415952] broker.crit[424]: child 849 idle for 3 heartbeats
[1464387480.415641] broker.crit[401]: child 803 idle for 3 heartbeats
[1464387480.415660] broker.crit[411]: child 824 idle for 3 heartbeats
[1464387480.415668] broker.crit[439]: child 879 idle for 3 heartbeats
[1464387480.415739] broker.crit[376]: child 753 idle for 3 heartbeats
[1464387480.415573] broker.crit[409]: child 819 idle for 3 heartbeats
[1464387480.415772] broker.crit[438]: child 878 idle for 3 heartbeats
[1464387480.415849] broker.crit[438]: child 877 idle for 3 heartbeats
[1464387480.415952] broker.crit[424]: child 849 idle for 3 heartbeats
1024
real 0m34.817s
user 0m0.179s
sys 0m0.298s
TBON fanout, as noted before, seems to have very little impact here
bash-4.2$ time srun -N128 -n1024 ./flux start -o,--k-ary=8 flux getattr size
1024
real 0m34.734s
user 0m0.176s
sys 0m0.309s
bash-4.2$ time srun -N128 -n1024 ./flux start -o,--k-ary=32 flux getattr size
1024
real 0m34.993s
user 0m0.170s
sys 0m0.324s
Here's a summary of results (above) so far plus another baseline on current master (rightmost col).
This latest baseline was within a 64 node salloc (with warm page cache) instead of 128 like before, so twice as many ranks per node were run. Times are "real" time in seconds. The "job" was e.g.
$ time srun -N64 -n1024 ./flux start /bin/true
Scale | Baseline | MPI hello | rc disabled | flux-framework/flux-core@2bde1f86750c239a6884c47b2c7ff4db0543d057 |
---|---|---|---|---|
128 | 7.147 | 4.962 | 3.825 | 4.851 |
256 | 9.238 | 4.205 | 3.438 | 6.433 |
512 | 16.258 | 5.197 | 3.885 | 10.168 |
1024 | 39.199 | 7.785 | 7.707 | 22.643 |
2048 | - | 13.462 | 9.323 | - |
4096 | - | 27.137 | - | - |
Got some time on opal today and here's the results with newest Slurm package (flux-core from @garlick's keyval2
branch):
Scale | garlick/flux-core@3f781fcae80673a13c2123205fc354c2d8d2e2ec | no resource-hwloc |
---|---|---|
128 | 0m5.569s | 0m3.680s |
512 | 0m11.108s | 0m4.732s |
1024 | 0m22.348s | 0m8.105s |
2048 | 0m46.115s | 0m9.844s |
4096 | 2m23.918s | 0m14.142s |
It took several runs to get 4096 brokers on 128 nodes without hitting flux-framework/flux-core#683.
To help determine what was holding up the larger runs, I added set -x
to etc/rc{1,3}
, the result from a 4096 job made it apparent that resource-hwloc
initialization is the issue:
grondo@opal34:~/git/flux-core.git$ time srun -N128 -n4096 src/cmd/flux start /bin/true
2016-07-01T22:32:47.255958Z broker.err[0]: rc1: + flux module load -r all barrier
2016-07-01T22:32:48.065706Z broker.err[0]: rc1: + flux module load -r 0 content-sqlite
2016-07-01T22:32:48.261366Z broker.err[0]: rc1: + flux module load -r 0 kvs
2016-07-01T22:32:48.307912Z broker.err[0]: rc1: + flux module load -r all -x 0 kvs
2016-07-01T22:32:49.090565Z broker.err[0]: rc1: + pids=' 191602'
2016-07-01T22:32:49.090643Z broker.err[0]: rc1: + flux module load -r all job
2016-07-01T22:32:49.100627Z broker.err[0]: rc1: + flux module load -r all resource-hwloc
2016-07-01T22:32:51.434498Z broker.err[0]: rc1: + flux module load -r all wrexec
2016-07-01T22:33:01.483694Z broker.err[0]: rc1: + flux module load -r 0 cron sync=hb
2016-07-01T22:33:02.213812Z broker.err[0]: rc1: + wait 191602
2016-07-01T22:36:27.215606Z broker.err[0]: rc1: ++ cd /g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218031Z broker.err[0]: rc1: ++ pwd -P
2016-07-01T22:36:27.218047Z broker.err[0]: rc1: + core_dir=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218438Z broker.err[0]: rc1: + all_dirs=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218453Z broker.err[0]: rc1: + IFS=:
2016-07-01T22:36:27.218459Z broker.err[0]: rc1: + shopt -s nullglob
2016-07-01T22:36:27.218465Z broker.err[0]: rc1: + for rcdir in '$all_dirs'
2016-07-01T22:36:27.218471Z broker.err[0]: rc1: + shopt -u nullglob
2016-07-01T22:36:27.341774Z broker.err[0]: rc3: ++ cd /g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341800Z broker.err[0]: rc3: ++ pwd -P
2016-07-01T22:36:27.341808Z broker.err[0]: rc3: + core_dir=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341987Z broker.err[0]: rc3: + all_dirs=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341997Z broker.err[0]: rc3: + IFS=:
2016-07-01T22:36:27.342001Z broker.err[0]: rc3: + shopt -s nullglob
2016-07-01T22:36:27.342004Z broker.err[0]: rc3: + for rcdir in '$all_dirs'
2016-07-01T22:36:27.342009Z broker.err[0]: rc3: + shopt -u nullglob
2016-07-01T22:36:27.342014Z broker.err[0]: rc3: + flux module remove -r 0 cron
2016-07-01T22:36:27.461050Z broker.err[0]: rc3: + flux module remove -r all wrexec
2016-07-01T22:36:27.972383Z broker.err[0]: rc3: + flux module remove -r all job
2016-07-01T22:36:28.683489Z broker.err[0]: rc3: + flux module remove -r all resource-hwloc
2016-07-01T22:36:29.279237Z broker.err[0]: rc3: + flux module remove -r all kvs
2016-07-01T22:36:29.917299Z broker.err[0]: rc3: + flux module remove -r all barrier
2016-07-01T22:36:30.515317Z broker.err[0]: rc3: + flux module remove -r 0 content-sqlite
Repeating @grondo's test above but with newer flux-core on 128 nodes of jade
This is in a 128 node salloc, after throwing out the first scale=128 job on a cold cache
Scale | 0.4.1 | no resource-hwloc |
---|---|---|
128 | 0m3.339s | 0m3.148s |
512 | 0m4.274s | 0m3.563s |
1024 | 0m8.480s | 0m7.087s |
2048 | 0m14.707s | 0m9.459s |
4096 | 0m32.517s | 0m11.103s |
Here's the same test on 2048 nodes of jade.
Scale | 0.4.1 | no resource-hwloc |
---|---|---|
2048 | 0m18.474s | 0m13.924s |
4096 | 0m27.436s | 0m23.976s |
8192 | 0m44.178s | 0m25.363s |
16384 | 1m59.902s | 0m45.238s |
32768 | 10m15.722s | 1m17.873s |
65536 | (>30m, cancelled) | 2m18.592s |
131072 | (not attempted) | 5m5.436s |
262144 | (not attempted) | (segfault, see below) |
Notes: at 16K and beyond, lots of messages like this during wireup
2016-08-12T19:10:20.682442Z broker.crit[10528]: child 21058 idle for 3 heartbeats
I (insanely) tried to run 256K tasks and got
$ time srun --overcommit -N2048 -n262144 src/cmd/flux start /bin/true
srun: error: jade2244: tasks 149460,151508,153556,157652,159700,161748,163796,165844,167892,169940,171988,174036,176084,178132,180180,182228,184276,188372,190420,192468,194516,196564,198612,200660,202708,204756,208852,212948,214996,217044,219092,223188,225236,227284,229332,231380,233428,235476,237524,239572,241620,243668,245716,247764,249812,251860,253908,255956,258004,260052,262100: Segmentation fault
srun: error: jade1308: tasks 148562,150610,152658,156754,158802,160850,162898,164946,166994,169042,171090,173138,175186,177234,179282,181330,183378,187474,189522,191570,193618,195666,197714,199762,201810,203858,207954,212050,214098,216146,218194,222290,224338,226386,228434,230482,232530,234578,236626,238674,240722,242770,244818,246866,248914,250962,253010,255058,257106,259154,261202: Segmentation fault
srun: error: jade269: tasks 147709,149757,151805,155901,157949,159997,162045,164093,166141,168189,170237,172285,174333,176381,178429,180477,182525,186621,188669,190717,192765,194813,196861,198909,200957,203005,207101,211197,213245,215293,217341,221437,223485,225533,227581,229629,231677,233725,235773,237821,239869,241917,243965,246013,248061,250109,252157,254205,256253,258301,260349: Segmentation fault
[snip - same from many nodes]
...
This occurred within a few seconds of startup. No cores to see what might be segfaulting.
Incidentally this is with zeromq-4.1.5-2.ch6.x86_64
which has the fix for the getifaddrs assertion we were hitting in flux-framework/flux-core#683
I did get a chance to run 256K with --mpibind=off
and got an odd failure. My salloc simply exited without a message to indicate why. I assumed at the time that it must have been the half hour time limit imposed on the partition, although in retrospect it seems like that should have resulted in a notification of some kind. Will retry later.
If 1187130 was the job, it did indeed reach timeout:
$ sacct -j 1187130 -o $SACCT_FORMAT JobID User JobName Account Cluster Partition QOS NNodes State Start Elapsed ExitCode
1187130 garlick sh lc jade pbatch normal 2K TIMEOUT 2016-08-12T14:27:57 00:30:00 0:1 1187130.0 bash lc jade 1 CANCELLED 2016-08-12T14:27:57 00:30:32 0:9 1187130.1 flux lc jade 2K COMPLETED 2016-08-12T14:29:51 00:00:24 0:0 1187130.2 flux lc jade 2K COMPLETED 2016-08-12T14:30:17 00:00:15 0:0 1187130.3 flux lc jade 2K COMPLETED 2016-08-12T14:30:49 00:18:48 0:0
Thanks for that insight @lipari.
By the way I just grabbed 1024 nodes on jade, and was able to run 131072 ranks (128 tasks per node) with --overcommit
and default mpibind. No crashes.
Here's some data before jade became unavailable due to lustre problems. resource-hwloc was disabled.
time srun [-O] -N1024 -n$scale ./flux start -o,-Slog-stderr-level=6 flux getattr size
Scale | total | pmi | wireup | rc1 |
---|---|---|---|---|
2048 | 0m12.176s | 0.4s | 1.2s | 1.7s |
4096 | 0m14.904s | 0.6s | 1.3s | 3.8s |
8192 | 0m21.969s | 1.5s | 2.6s | 6.8s |
16384 | 0m31.908s | 3.2s | 4.2s | 12.4s |
32768 | 11m3.476s | 11.8s | 14.3s | 25.5s |
65536 | 2m14.940s | 45.3s | 24.3s | 56.7s |
131072 | 4m34.045s | 75.1s | 52.2s | 113.3s |
This was using more or less flux-core v0.4.1-26-gb8249b4
Incidentally when I try to run 262144 on 1024 nodes (256 procs per node), with overcommit, I get
time srun --overcommit -N1024 -n262144 ./flux start -o,-Slog-stderr-level=6 flux getattr size
srun: error: Unable to create job step: Task count specification invalid
I can't see anything in the slurm.conf that limts this.
grondo@opal186:~$ grep MAX_TASKS /usr/include/slurm/*
/usr/include/slurm/slurm.h:#define MAX_TASKS_PER_NODE 128
Hey now, no need to thumbs down my comment. ;-)
It does seem like that might be a default value for slurm.conf parameter MaxTasksPerNode
. Maybe we can get someone to attempt to bump that up?
MaxTasksPerNode
Maximum number of tasks SLURM will allow a job step to spawn on
a single node. The default MaxTasksPerNode is 128.
just came out of a meeting or I would have provided a more timely response. Yes MAX_TASKS_PER_NODE is compiled in. I'll have to create a new version of Slurm. I'm inclined to take a conservative approach and go with 72 * 4 = 288.
You might comment in TOSS-3371. Trent just came by and said the interconnect would run out of contexts before 256 tasks, so he's worried about people hitting that (not us, alas, as we are not using the interconnect). Maybe a non-starter in other words.
I was able to salloc 2048 nodes for that last data point (128 tasks per node):
Scale | total | pmi | wireup | rc1 |
---|---|---|---|---|
262144 | 11m36.892s | 277.2s | 176.0s | 216.3s |
New build just completed: Package: slurm-2.3.3-1.27chaos.ch6 MAX_TASKS_PER_NODE is now 288.
Thanks @lipari!
I think we've nailed the first exit criteria for this task. This is on 2048 nodes but I ran 2304 tasks per the description above. With resource-hwloc enabled:
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:39:11.869694Z broker.info[0]: pmi: bootstrap time 0.6s
2016-08-16T21:39:13.422545Z broker.info[0]: nodeset: [0-2303] (complete) 1.5s
2016-08-16T21:39:13.422563Z broker.info[0]: Run level 1 starting
2016-08-16T21:39:25.309520Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:39:25.310843Z broker.info[0]: Run level 1 Exited (rc=0) 11.9s
2016-08-16T21:39:25.310860Z broker.info[0]: Run level 2 starting
2016-08-16T21:39:25.329017Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:39:25.329136Z broker.info[0]: Run level 3 starting
2016-08-16T21:39:25.340499Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:39:26.507580Z broker.info[0]: Run level 3 Exited (rc=0) 1.2s
real 0m26.119s
user 0m1.205s
sys 0m1.298s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:39:43.322640Z broker.info[0]: pmi: bootstrap time 0.5s
2016-08-16T21:39:43.840434Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:39:43.840451Z broker.info[0]: Run level 1 starting
2016-08-16T21:39:47.457799Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:39:47.919774Z broker.info[0]: Run level 1 Exited (rc=0) 4.1s
2016-08-16T21:39:47.919796Z broker.info[0]: Run level 2 starting
2016-08-16T21:39:47.935860Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:39:47.935935Z broker.info[0]: Run level 3 starting
2016-08-16T21:39:47.945905Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:39:49.041318Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m16.860s
user 0m1.164s
sys 0m1.348s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:05.980638Z broker.info[0]: pmi: bootstrap time 0.2s
2016-08-16T21:40:07.289600Z broker.info[0]: nodeset: [0-2303] (complete) 1.3s
2016-08-16T21:40:07.289614Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:10.879688Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:11.216868Z broker.info[0]: Run level 1 Exited (rc=0) 3.9s
2016-08-16T21:40:11.216888Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:11.233356Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:11.233428Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:11.243733Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:12.340431Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m17.440s
user 0m1.278s
sys 0m1.315s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:25.446893Z broker.info[0]: pmi: bootstrap time 0.3s
2016-08-16T21:40:26.218384Z broker.info[0]: nodeset: [0-2303] (complete) 0.8s
2016-08-16T21:40:26.218400Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:31.118952Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:31.604560Z broker.info[0]: Run level 1 Exited (rc=0) 5.4s
2016-08-16T21:40:31.604581Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:31.621036Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:31.621106Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:31.631618Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:32.727289Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m18.490s
user 0m1.199s
sys 0m1.392s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:46.181981Z broker.info[0]: pmi: bootstrap time 0.3s
2016-08-16T21:40:46.736243Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:40:46.736260Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:50.262221Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:50.672184Z broker.info[0]: Run level 1 Exited (rc=0) 3.9s
2016-08-16T21:40:50.672206Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:50.689402Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:50.689471Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:50.699951Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:51.815295Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m16.750s
user 0m1.201s
sys 0m1.341s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:41:10.325775Z broker.info[0]: pmi: bootstrap time 0.2s
2016-08-16T21:41:11.585263Z broker.info[0]: nodeset: [0-2303] (complete) 1.2s
2016-08-16T21:41:11.585279Z broker.info[0]: Run level 1 starting
2016-08-16T21:41:15.128463Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:41:15.541739Z broker.info[0]: Run level 1 Exited (rc=0) 4.0s
2016-08-16T21:41:15.541760Z broker.info[0]: Run level 2 starting
2016-08-16T21:41:15.557939Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:41:15.558012Z broker.info[0]: Run level 3 starting
2016-08-16T21:41:15.568516Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:41:16.651927Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m17.385s
user 0m1.242s
sys 0m1.258s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:42:05.842650Z broker.info[0]: pmi: bootstrap time 0.5s
2016-08-16T21:42:06.743197Z broker.info[0]: nodeset: [0-2303] (complete) 0.9s
2016-08-16T21:42:06.743213Z broker.info[0]: Run level 1 starting
2016-08-16T21:42:11.770012Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:42:12.237816Z broker.info[0]: Run level 1 Exited (rc=0) 5.5s
2016-08-16T21:42:12.237836Z broker.info[0]: Run level 2 starting
2016-08-16T21:42:12.254588Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:42:12.254657Z broker.info[0]: Run level 3 starting
2016-08-16T21:42:12.265252Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:42:13.366483Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m18.977s
user 0m1.133s
sys 0m1.392s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:47:13.532172Z broker.info[0]: pmi: bootstrap time 0.7s
2016-08-16T21:47:15.282076Z broker.info[0]: nodeset: [0-2303] (complete) 1.7s
2016-08-16T21:47:15.282089Z broker.info[0]: Run level 1 starting
2016-08-16T21:47:20.390378Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:47:20.864776Z broker.info[0]: Run level 1 Exited (rc=0) 5.6s
2016-08-16T21:47:20.864799Z broker.info[0]: Run level 2 starting
2016-08-16T21:47:20.881854Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:47:20.881922Z broker.info[0]: Run level 3 starting
2016-08-16T21:47:20.892905Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:47:22.054511Z broker.info[0]: Run level 3 Exited (rc=0) 1.2s
real 0m20.302s
user 0m1.212s
sys 0m1.379s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:47:57.932496Z broker.info[0]: pmi: bootstrap time 0.4s
2016-08-16T21:47:58.445418Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:47:58.445437Z broker.info[0]: Run level 1 starting
2016-08-16T21:48:01.979773Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:48:02.528096Z broker.info[0]: Run level 1 Exited (rc=0) 4.1s
2016-08-16T21:48:02.528118Z broker.info[0]: Run level 2 starting
2016-08-16T21:48:02.545963Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:48:02.546035Z broker.info[0]: Run level 3 starting
2016-08-16T21:48:02.557085Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:48:03.659512Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m16.614s
user 0m1.221s
sys 0m1.269s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:48:16.124555Z broker.info[0]: pmi: bootstrap time 0.8s
2016-08-16T21:48:17.781674Z broker.info[0]: nodeset: [0-2303] (complete) 1.6s
2016-08-16T21:48:17.781691Z broker.info[0]: Run level 1 starting
2016-08-16T21:48:22.678567Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:48:23.170692Z broker.info[0]: Run level 1 Exited (rc=0) 5.4s
2016-08-16T21:48:23.170712Z broker.info[0]: Run level 2 starting
2016-08-16T21:48:23.187714Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:48:23.187783Z broker.info[0]: Run level 3 starting
2016-08-16T21:48:23.198795Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:48:24.293465Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s
real 0m19.871s
user 0m1.235s
sys 0m1.392s
Oops one more run so we can throw out the numbers for the first cold-cache run above and still have 10x. None of these took more than 21s, to complete.
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:50:06.053167Z broker.info[0]: pmi: bootstrap time 0.7s
2016-08-16T21:50:07.782006Z broker.info[0]: nodeset: [0-2303] (complete) 1.7s
2016-08-16T21:50:07.782023Z broker.info[0]: Run level 1 starting
2016-08-16T21:50:12.869093Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:50:13.347550Z broker.info[0]: Run level 1 Exited (rc=0) 5.6s
2016-08-16T21:50:13.347572Z broker.info[0]: Run level 2 starting
2016-08-16T21:50:13.364337Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:50:13.364410Z broker.info[0]: Run level 3 starting
2016-08-16T21:50:13.375358Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:50:14.814931Z broker.info[0]: Run level 3 Exited (rc=0) 1.4s
real 0m20.255s
user 0m1.226s
sys 0m1.337s
Here's a baseline:
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 2.094s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.517s
real 0m9.129s
user 0m1.228s
sys 0m1.333s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.107s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s
real 0m7.666s
user 0m1.224s
sys 0m1.337s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.134s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.495s
real 0m8.016s
user 0m1.220s
sys 0m1.303s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.119s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.175s
real 0m7.675s
user 0m1.179s
sys 0m1.405s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.161s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.331s
real 0m7.864s
user 0m1.288s
sys 0m1.325s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.142s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s
real 0m7.685s
user 0m1.255s
sys 0m1.263s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.198s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s
real 0m7.758s
user 0m1.240s
sys 0m1.373s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.152s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.174s
real 0m13.695s
user 0m1.257s
sys 0m1.315s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.091s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.176s
real 0m7.657s
user 0m1.207s
sys 0m1.383s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.173s. There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.176s
real 0m7.720s
user 0m1.273s
sys 0m1.368s
Here's a small scaling study launching flux within an allocation of 2048 nodes with procs varying from 192 to 3071. Times (seconds) are averages over 5 runs. The "hello" column is a baseline run of our mpi hello world program.
Scale | total | pmi | wireup | rc1 | hello |
---|---|---|---|---|---|
192 | 3.95 | 0.1 | 0.0 | 0.90 | 1.76 |
384 | 4.73 | 0.2 | 0.2 | 0.94 | 2.33 |
768 | 6.88 | 0.3 | 0.4 | 1.6 | 3.51 |
1536 | 14.38 | 0.3 | 0.9 | 2.98 | 5.81 |
3071 | 19.45 | 0.4 | 1.2 | 5.72 | 8.42 |
I think this component of the milestone is complete. I'll leave this open through the rest of the CTS-1 testing period.
_Goals_
Determine Flux startup/teardown scalability. Find any bugs revealed at scale.
_Methodology_
Gather baseline SLURM job launch data, so SLURM overhead can be subtracted from Flux times. Use MPI program that bootstraps with PMI like Flux. This SLURM data may also be useful for comparison with Flux job launch data (not covered here). Get enough runs to extract valid statistics. Run at various scales, e.g. 1,2,4,8,12 scalable units (one task per node).
Instrument and time the following phases of Flux startup/teardown:
Run through same scales as baseline, and run enough to extract valid statistics. Log issues for all anomalies linked from #12
Extra credit: continue baseline and Flux scaling to one task per core.
Publish baseline and Flux statistics in this issue.
_resource requirements_
Ramp-up beginning on opal (1 SU, 192 nodes, 6912 cores) to full CTS-1 system.
_exit criteria_
srun -N2304 flux start /bin/true
_Issues to watch out for_
flux module load -r all
is called serially in rc1 and the converse in rc3, and seems not to scale well (more info needed)