CTS-1 launch flux on full system

garlick commented 8 years ago

_Goals_

Determine Flux startup/teardown scalability. Find any bugs revealed at scale.

_Methodology_

Gather baseline SLURM job launch data, so SLURM overhead can be subtracted from Flux times. Use MPI program that bootstraps with PMI like Flux. This SLURM data may also be useful for comparison with Flux job launch data (not covered here). Get enough runs to extract valid statistics. Run at various scales, e.g. 1,2,4,8,12 scalable units (one task per node).

Instrument and time the following phases of Flux startup/teardown:

Full bootstrap/wireup time (split out PMI exchange)
Time to execute rc1
Time to execute rc3

Run through same scales as baseline, and run enough to extract valid statistics. Log issues for all anomalies linked from #12

Extra credit: continue baseline and Flux scaling to one task per core.

Publish baseline and Flux statistics in this issue.

_resource requirements_

Ramp-up beginning on opal (1 SU, 192 nodes, 6912 cores) to full CTS-1 system.

_exit criteria_

No failures in 10 consecutive runs of srun -N2304 flux start /bin/true
No run should take longer than N + baseline (FIXME: need value for N)

_Issues to watch out for_

flux module load -r all is called serially in rc1 and the converse in rc3, and seems not to scale well (more info needed)
resource-hwloc stores broken-out hwloc data in the KVS on every node
load time of module.so out of NFS

garlick commented 8 years ago

Quick opal test:

$ time srun -N8 ./flux broker flux getattr size
8

real    0m1.866s
user    0m0.016s
sys 0m0.030s
$ time srun -N16 ./flux broker flux getattr size
16

real    0m2.568s
user    0m0.026s
sys 0m0.043s
$ time srun -N32 ./flux broker flux getattr size
32

real    0m2.859s
user    0m0.039s
sys 0m0.079s
$ time srun -N64 ./flux broker flux getattr size
64

real    0m11.841s
user    0m0.058s
sys 0m0.098s
$ time srun -N128 ./flux broker flux getattr size
128

real    0m7.147s
user    0m0.094s
sys 0m0.194s
$ time srun -N128 -n256 ./flux broker flux getattr size
256

real    0m9.238s
user    0m0.114s
sys 0m0.187s
$ time srun -N128 -n512 ./flux broker flux getattr size
512

real    0m16.258s
user    0m0.127s
sys 0m0.220s
$ time srun -N128 -n1024 ./flux broker flux getattr size
1024

real    0m39.199s
user    0m0.175s
sys 0m0.326s

(got core dump at size 2048 - see flux-framework/flux-core#683)

Baseline with mpi_hello compiled with Intel MPI which uses PMI:

$ time srun -N 8  ./hello
0: completed MPI_Init in 0.163s.  There are 8 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s

real    0m2.443s
user    0m0.033s
sys 0m0.047s
$ time srun -N 16  ./hello
0: completed MPI_Init in 0.671s.  There are 16 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s

real    0m3.077s
user    0m0.047s
sys 0m0.081s
$ time srun -N 32  ./hello
0: completed MPI_Init in 0.973s.  There are 32 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s

real    0m3.219s
user    0m0.107s
sys 0m0.130s
$ time srun -N 64  ./hello
0: completed MPI_Init in 1.376s.  There are 64 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.004s

real    0m3.685s
user    0m0.182s
sys 0m0.193s
$ time srun -N 128  ./hello
0: completed MPI_Init in 2.320s.  There are 128 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 1.004s

real    0m4.962s
user    0m0.265s
sys 0m0.367s
$ time srun -N 128  -n 256 ./hello
0: completed MPI_Init in 1.404s.  There are 256 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.005s

real    0m4.205s
user    0m0.351s
sys 0m0.440s
$ time srun -N 128  -n 512 ./hello
0: completed MPI_Init in 2.408s.  There are 512 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 2.007s

real    0m5.197s
user    0m0.396s
sys 0m0.757s
$ time srun -N 128  -n 1024 ./hello
0: completed MPI_Init in 4.480s.  There are 1024 tasks
0: completed first barrier in 0.001s
0: completed MPI_Finalize in 2.043s

real    0m7.785s
user    0m0.470s
sys 0m0.984s
$ time srun -N 128  -n 2048 ./hello
0: completed MPI_Init in 7.954s.  There are 2048 tasks
0: completed first barrier in 0.001s
0: completed MPI_Finalize in 4.245s

real    0m13.462s
user    0m0.732s
sys 0m1.564s
$ time srun -N 128  -n 4096 ./hello
0: completed MPI_Init in 17.283s.  There are 4096 tasks
0: completed first barrier in 0.002s
0: completed MPI_Finalize in 8.338s

real    0m27.137s
user    0m1.431s
sys 0m3.099s

garlick commented 8 years ago

Now numbers rc1/rc3 disabled:

$ export FLUX_RC1_PATH=""
$ export FLUX_RC3_PATH=""
$ time srun -N8 ./flux broker flux getattr size
8

real    0m1.103s
user    0m0.019s
sys 0m0.034s
$ time srun -N16 ./flux broker flux getattr size
16

real    0m1.796s
user    0m0.026s
sys 0m0.052s
$ time srun -N32 ./flux broker flux getattr size
32

real    0m1.651s
user    0m0.054s
sys 0m0.062s
$ time srun -N64 ./flux broker flux getattr size
64

real    0m1.836s
user    0m0.048s
sys 0m0.093s
$ time srun -N128 ./flux broker flux getattr size
128

real    0m3.825s
user    0m0.108s
sys 0m0.159s
$ time srun -N128 -n256  ./flux broker flux getattr size
256

real    0m3.438s
user    0m0.123s
sys 0m0.182s
$ time srun -N128 -n512  ./flux broker flux getattr size
512

real    0m3.885s
user    0m0.141s
sys 0m0.236s
$ time srun -N128 -n1024  ./flux broker flux getattr size
1024

real    0m7.707s
user    0m0.221s
sys 0m0.481s
$ time srun -N128 -n2048  ./flux broker flux getattr size
2048

real    0m9.323s
user    0m0.329s
sys 0m0.859s

(core dump at size=4096 - see flux-framework/flux-core#683)

garlick commented 8 years ago

Conclusions:

flux launches faster than MPI hello world with no modules loaded
flux dumps core at larger scales
flux scales poorly with default modules loaded

grondo commented 8 years ago

I wonder if perf timechart run on the broker 0 node would offer any insight as a first cut?

I was able to run perf timechart record su - grondo -- /path/to/flux start --size=8 flux COMMAND and perf timechart to get a chart, but I couldn't glean too much from it right away. On a TOSS 2 system I couldn't get -g option to resolve symbols to work.

garlick commented 8 years ago

Here's a few more data points on opal, with rc1/rc3 edited to drop resource-hwloc. That doesn't improve things as well as one might hope!

bash-4.2$ time srun -N8 ./flux start flux getattr size
8

real    0m1.917s
user    0m0.017s
sys 0m0.029s
bash-4.2$ time srun -N16 ./flux start flux getattr size
16

real    0m2.143s
user    0m0.025s
sys 0m0.043s
bash-4.2$ time srun -N32 ./flux start flux getattr size
32

real    0m2.281s
user    0m0.045s
sys 0m0.050s
bash-4.2$ time srun -N64 ./flux start flux getattr size
64

real    0m2.636s
user    0m0.054s
sys 0m0.091s
bash-4.2$ time srun -N128 ./flux start flux getattr size
128

real    0m4.758s
user    0m0.117s
sys 0m0.153s
bash-4.2$ time srun -N128 -n256 ./flux start flux getattr size
256

real    0m6.550s
user    0m0.115s
sys 0m0.151s
bash-4.2$ time srun -N128 -n512 ./flux start flux getattr size
512

real    0m12.231s
user    0m0.134s
sys 0m0.203s
bash-4.2$ time srun -N128 -n1024 ./flux start flux getattr size
[1464387480.415573] broker.crit[409]: child 819 idle for 3 heartbeats
[1464387480.415660] broker.crit[411]: child 824 idle for 3 heartbeats
[1464387480.415641] broker.crit[401]: child 803 idle for 3 heartbeats
[1464387480.415668] broker.crit[439]: child 879 idle for 3 heartbeats
[1464387480.415739] broker.crit[376]: child 753 idle for 3 heartbeats
[1464387480.415772] broker.crit[438]: child 878 idle for 3 heartbeats
[1464387480.415849] broker.crit[438]: child 877 idle for 3 heartbeats
[1464387480.415952] broker.crit[424]: child 849 idle for 3 heartbeats
[1464387480.415641] broker.crit[401]: child 803 idle for 3 heartbeats
[1464387480.415660] broker.crit[411]: child 824 idle for 3 heartbeats
[1464387480.415668] broker.crit[439]: child 879 idle for 3 heartbeats
[1464387480.415739] broker.crit[376]: child 753 idle for 3 heartbeats
[1464387480.415573] broker.crit[409]: child 819 idle for 3 heartbeats
[1464387480.415772] broker.crit[438]: child 878 idle for 3 heartbeats
[1464387480.415849] broker.crit[438]: child 877 idle for 3 heartbeats
[1464387480.415952] broker.crit[424]: child 849 idle for 3 heartbeats
1024

real    0m34.817s
user    0m0.179s
sys 0m0.298s

garlick commented 8 years ago

TBON fanout, as noted before, seems to have very little impact here

bash-4.2$ time srun -N128 -n1024 ./flux start -o,--k-ary=8 flux getattr size
1024

real    0m34.734s
user    0m0.176s
sys 0m0.309s
bash-4.2$ time srun -N128 -n1024 ./flux start -o,--k-ary=32 flux getattr size
1024

real    0m34.993s
user    0m0.170s
sys 0m0.324s

garlick commented 8 years ago

Here's a summary of results (above) so far plus another baseline on current master (rightmost col).

This latest baseline was within a 64 node salloc (with warm page cache) instead of 128 like before, so twice as many ranks per node were run. Times are "real" time in seconds. The "job" was e.g.

$ time srun -N64 -n1024 ./flux start /bin/true

Scale	Baseline	MPI hello	rc disabled	flux-framework/flux-core@2bde1f86750c239a6884c47b2c7ff4db0543d057
128	7.147	4.962	3.825	4.851
256	9.238	4.205	3.438	6.433
512	16.258	5.197	3.885	10.168
1024	39.199	7.785	7.707	22.643
2048	-	13.462	9.323	-
4096	-	27.137	-	-

grondo commented 8 years ago

Got some time on opal today and here's the results with newest Slurm package (flux-core from @garlick's keyval2 branch):

Scale	garlick/flux-core@3f781fcae80673a13c2123205fc354c2d8d2e2ec	no resource-hwloc
128	0m5.569s	0m3.680s
512	0m11.108s	0m4.732s
1024	0m22.348s	0m8.105s
2048	0m46.115s	0m9.844s
4096	2m23.918s	0m14.142s

It took several runs to get 4096 brokers on 128 nodes without hitting flux-framework/flux-core#683.

To help determine what was holding up the larger runs, I added set -x to etc/rc{1,3}, the result from a 4096 job made it apparent that resource-hwloc initialization is the issue:

 grondo@opal34:~/git/flux-core.git$ time srun -N128 -n4096 src/cmd/flux start /bin/true
2016-07-01T22:32:47.255958Z broker.err[0]: rc1: + flux module load -r all barrier
2016-07-01T22:32:48.065706Z broker.err[0]: rc1: + flux module load -r 0 content-sqlite
2016-07-01T22:32:48.261366Z broker.err[0]: rc1: + flux module load -r 0 kvs
2016-07-01T22:32:48.307912Z broker.err[0]: rc1: + flux module load -r all -x 0 kvs
2016-07-01T22:32:49.090565Z broker.err[0]: rc1: + pids=' 191602'
2016-07-01T22:32:49.090643Z broker.err[0]: rc1: + flux module load -r all job
2016-07-01T22:32:49.100627Z broker.err[0]: rc1: + flux module load -r all resource-hwloc
2016-07-01T22:32:51.434498Z broker.err[0]: rc1: + flux module load -r all wrexec
2016-07-01T22:33:01.483694Z broker.err[0]: rc1: + flux module load -r 0 cron sync=hb
2016-07-01T22:33:02.213812Z broker.err[0]: rc1: + wait 191602
2016-07-01T22:36:27.215606Z broker.err[0]: rc1: ++ cd /g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218031Z broker.err[0]: rc1: ++ pwd -P
2016-07-01T22:36:27.218047Z broker.err[0]: rc1: + core_dir=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218438Z broker.err[0]: rc1: + all_dirs=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.218453Z broker.err[0]: rc1: + IFS=:
2016-07-01T22:36:27.218459Z broker.err[0]: rc1: + shopt -s nullglob
2016-07-01T22:36:27.218465Z broker.err[0]: rc1: + for rcdir in '$all_dirs'
2016-07-01T22:36:27.218471Z broker.err[0]: rc1: + shopt -u nullglob
2016-07-01T22:36:27.341774Z broker.err[0]: rc3: ++ cd /g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341800Z broker.err[0]: rc3: ++ pwd -P
2016-07-01T22:36:27.341808Z broker.err[0]: rc3: + core_dir=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341987Z broker.err[0]: rc3: + all_dirs=/g/g0/grondo/git/flux-core.git/etc
2016-07-01T22:36:27.341997Z broker.err[0]: rc3: + IFS=:
2016-07-01T22:36:27.342001Z broker.err[0]: rc3: + shopt -s nullglob
2016-07-01T22:36:27.342004Z broker.err[0]: rc3: + for rcdir in '$all_dirs'
2016-07-01T22:36:27.342009Z broker.err[0]: rc3: + shopt -u nullglob
2016-07-01T22:36:27.342014Z broker.err[0]: rc3: + flux module remove -r 0 cron
2016-07-01T22:36:27.461050Z broker.err[0]: rc3: + flux module remove -r all wrexec
2016-07-01T22:36:27.972383Z broker.err[0]: rc3: + flux module remove -r all job
2016-07-01T22:36:28.683489Z broker.err[0]: rc3: + flux module remove -r all resource-hwloc
2016-07-01T22:36:29.279237Z broker.err[0]: rc3: + flux module remove -r all kvs
2016-07-01T22:36:29.917299Z broker.err[0]: rc3: + flux module remove -r all barrier
2016-07-01T22:36:30.515317Z broker.err[0]: rc3: + flux module remove -r 0 content-sqlite

garlick commented 8 years ago

Repeating @grondo's test above but with newer flux-core on 128 nodes of jade

This is in a 128 node salloc, after throwing out the first scale=128 job on a cold cache

Scale	0.4.1	no resource-hwloc
128	0m3.339s	0m3.148s
512	0m4.274s	0m3.563s
1024	0m8.480s	0m7.087s
2048	0m14.707s	0m9.459s
4096	0m32.517s	0m11.103s

garlick commented 8 years ago

Here's the same test on 2048 nodes of jade.

Scale	0.4.1	no resource-hwloc
2048	0m18.474s	0m13.924s
4096	0m27.436s	0m23.976s
8192	0m44.178s	0m25.363s
16384	1m59.902s	0m45.238s
32768	10m15.722s	1m17.873s
65536	(>30m, cancelled)	2m18.592s
131072	(not attempted)	5m5.436s
262144	(not attempted)	(segfault, see below)

Notes: at 16K and beyond, lots of messages like this during wireup

2016-08-12T19:10:20.682442Z broker.crit[10528]: child 21058 idle for 3 heartbeats

I (insanely) tried to run 256K tasks and got

$ time srun --overcommit -N2048 -n262144  src/cmd/flux start /bin/true
srun: error: jade2244: tasks 149460,151508,153556,157652,159700,161748,163796,165844,167892,169940,171988,174036,176084,178132,180180,182228,184276,188372,190420,192468,194516,196564,198612,200660,202708,204756,208852,212948,214996,217044,219092,223188,225236,227284,229332,231380,233428,235476,237524,239572,241620,243668,245716,247764,249812,251860,253908,255956,258004,260052,262100: Segmentation fault
srun: error: jade1308: tasks 148562,150610,152658,156754,158802,160850,162898,164946,166994,169042,171090,173138,175186,177234,179282,181330,183378,187474,189522,191570,193618,195666,197714,199762,201810,203858,207954,212050,214098,216146,218194,222290,224338,226386,228434,230482,232530,234578,236626,238674,240722,242770,244818,246866,248914,250962,253010,255058,257106,259154,261202: Segmentation fault
srun: error: jade269: tasks 147709,149757,151805,155901,157949,159997,162045,164093,166141,168189,170237,172285,174333,176381,178429,180477,182525,186621,188669,190717,192765,194813,196861,198909,200957,203005,207101,211197,213245,215293,217341,221437,223485,225533,227581,229629,231677,233725,235773,237821,239869,241917,243965,246013,248061,250109,252157,254205,256253,258301,260349: Segmentation fault
[snip - same from many nodes]
...

This occurred within a few seconds of startup. No cores to see what might be segfaulting.

garlick commented 8 years ago

Incidentally this is with zeromq-4.1.5-2.ch6.x86_64 which has the fix for the getifaddrs assertion we were hitting in flux-framework/flux-core#683

garlick commented 8 years ago

I did get a chance to run 256K with --mpibind=off and got an odd failure. My salloc simply exited without a message to indicate why. I assumed at the time that it must have been the half hour time limit imposed on the partition, although in retrospect it seems like that should have resulted in a notification of some kind. Will retry later.

lipari commented 8 years ago

If 1187130 was the job, it did indeed reach timeout:

$ sacct -j 1187130 -o $SACCT_FORMAT JobID User JobName Account Cluster Partition QOS NNodes State Start Elapsed ExitCode

1187130 garlick sh lc jade pbatch normal 2K TIMEOUT 2016-08-12T14:27:57 00:30:00 0:1 1187130.0 bash lc jade 1 CANCELLED 2016-08-12T14:27:57 00:30:32 0:9 1187130.1 flux lc jade 2K COMPLETED 2016-08-12T14:29:51 00:00:24 0:0 1187130.2 flux lc jade 2K COMPLETED 2016-08-12T14:30:17 00:00:15 0:0 1187130.3 flux lc jade 2K COMPLETED 2016-08-12T14:30:49 00:18:48 0:0

garlick commented 8 years ago

Thanks for that insight @lipari.

By the way I just grabbed 1024 nodes on jade, and was able to run 131072 ranks (128 tasks per node) with --overcommit and default mpibind. No crashes.

Here's some data before jade became unavailable due to lustre problems. resource-hwloc was disabled.

time srun [-O] -N1024 -n$scale  ./flux start -o,-Slog-stderr-level=6 flux getattr size

Scale	total	pmi	wireup	rc1
2048	0m12.176s	0.4s	1.2s	1.7s
4096	0m14.904s	0.6s	1.3s	3.8s
8192	0m21.969s	1.5s	2.6s	6.8s
16384	0m31.908s	3.2s	4.2s	12.4s
32768	11m3.476s	11.8s	14.3s	25.5s
65536	2m14.940s	45.3s	24.3s	56.7s
131072	4m34.045s	75.1s	52.2s	113.3s

This was using more or less flux-core v0.4.1-26-gb8249b4

garlick commented 8 years ago

Incidentally when I try to run 262144 on 1024 nodes (256 procs per node), with overcommit, I get

time srun --overcommit -N1024 -n262144 ./flux start -o,-Slog-stderr-level=6 flux getattr size
srun: error: Unable to create job step: Task count specification invalid

I can't see anything in the slurm.conf that limts this.

grondo commented 8 years ago

 grondo@opal186:~$ grep MAX_TASKS /usr/include/slurm/*
/usr/include/slurm/slurm.h:#define MAX_TASKS_PER_NODE 128

grondo commented 8 years ago

Hey now, no need to thumbs down my comment. ;-)

It does seem like that might be a default value for slurm.conf parameter MaxTasksPerNode. Maybe we can get someone to attempt to bump that up?

       MaxTasksPerNode
              Maximum  number of tasks SLURM will allow a job step to spawn on
              a single node. The default MaxTasksPerNode is 128.

garlick commented 8 years ago

I opened TOSS-3371 requesting that. Thanks!

lipari commented 8 years ago

just came out of a meeting or I would have provided a more timely response. Yes MAX_TASKS_PER_NODE is compiled in. I'll have to create a new version of Slurm. I'm inclined to take a conservative approach and go with 72 * 4 = 288.

garlick commented 8 years ago

You might comment in TOSS-3371. Trent just came by and said the interconnect would run out of contexts before 256 tasks, so he's worried about people hitting that (not us, alas, as we are not using the interconnect). Maybe a non-starter in other words.

garlick commented 8 years ago

I was able to salloc 2048 nodes for that last data point (128 tasks per node):

Scale	total	pmi	wireup	rc1
262144	11m36.892s	277.2s	176.0s	216.3s

lipari commented 8 years ago

New build just completed: Package: slurm-2.3.3-1.27chaos.ch6 MAX_TASKS_PER_NODE is now 288.

garlick commented 8 years ago

Thanks @lipari!

I think we've nailed the first exit criteria for this task. This is on 2048 nodes but I ran 2304 tasks per the description above. With resource-hwloc enabled:

bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:39:11.869694Z broker.info[0]: pmi: bootstrap time 0.6s
2016-08-16T21:39:13.422545Z broker.info[0]: nodeset: [0-2303] (complete) 1.5s
2016-08-16T21:39:13.422563Z broker.info[0]: Run level 1 starting
2016-08-16T21:39:25.309520Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:39:25.310843Z broker.info[0]: Run level 1 Exited (rc=0) 11.9s
2016-08-16T21:39:25.310860Z broker.info[0]: Run level 2 starting
2016-08-16T21:39:25.329017Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:39:25.329136Z broker.info[0]: Run level 3 starting
2016-08-16T21:39:25.340499Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:39:26.507580Z broker.info[0]: Run level 3 Exited (rc=0) 1.2s

real    0m26.119s
user    0m1.205s
sys 0m1.298s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:39:43.322640Z broker.info[0]: pmi: bootstrap time 0.5s
2016-08-16T21:39:43.840434Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:39:43.840451Z broker.info[0]: Run level 1 starting
2016-08-16T21:39:47.457799Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:39:47.919774Z broker.info[0]: Run level 1 Exited (rc=0) 4.1s
2016-08-16T21:39:47.919796Z broker.info[0]: Run level 2 starting
2016-08-16T21:39:47.935860Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:39:47.935935Z broker.info[0]: Run level 3 starting
2016-08-16T21:39:47.945905Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:39:49.041318Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m16.860s
user    0m1.164s
sys 0m1.348s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:05.980638Z broker.info[0]: pmi: bootstrap time 0.2s
2016-08-16T21:40:07.289600Z broker.info[0]: nodeset: [0-2303] (complete) 1.3s
2016-08-16T21:40:07.289614Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:10.879688Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:11.216868Z broker.info[0]: Run level 1 Exited (rc=0) 3.9s
2016-08-16T21:40:11.216888Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:11.233356Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:11.233428Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:11.243733Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:12.340431Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m17.440s
user    0m1.278s
sys 0m1.315s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:25.446893Z broker.info[0]: pmi: bootstrap time 0.3s
2016-08-16T21:40:26.218384Z broker.info[0]: nodeset: [0-2303] (complete) 0.8s
2016-08-16T21:40:26.218400Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:31.118952Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:31.604560Z broker.info[0]: Run level 1 Exited (rc=0) 5.4s
2016-08-16T21:40:31.604581Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:31.621036Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:31.621106Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:31.631618Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:32.727289Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m18.490s
user    0m1.199s
sys 0m1.392s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:40:46.181981Z broker.info[0]: pmi: bootstrap time 0.3s
2016-08-16T21:40:46.736243Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:40:46.736260Z broker.info[0]: Run level 1 starting
2016-08-16T21:40:50.262221Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:40:50.672184Z broker.info[0]: Run level 1 Exited (rc=0) 3.9s
2016-08-16T21:40:50.672206Z broker.info[0]: Run level 2 starting
2016-08-16T21:40:50.689402Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:40:50.689471Z broker.info[0]: Run level 3 starting
2016-08-16T21:40:50.699951Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:40:51.815295Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m16.750s
user    0m1.201s
sys 0m1.341s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:41:10.325775Z broker.info[0]: pmi: bootstrap time 0.2s
2016-08-16T21:41:11.585263Z broker.info[0]: nodeset: [0-2303] (complete) 1.2s
2016-08-16T21:41:11.585279Z broker.info[0]: Run level 1 starting
2016-08-16T21:41:15.128463Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:41:15.541739Z broker.info[0]: Run level 1 Exited (rc=0) 4.0s
2016-08-16T21:41:15.541760Z broker.info[0]: Run level 2 starting
2016-08-16T21:41:15.557939Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:41:15.558012Z broker.info[0]: Run level 3 starting
2016-08-16T21:41:15.568516Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:41:16.651927Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m17.385s
user    0m1.242s
sys 0m1.258s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:42:05.842650Z broker.info[0]: pmi: bootstrap time 0.5s
2016-08-16T21:42:06.743197Z broker.info[0]: nodeset: [0-2303] (complete) 0.9s
2016-08-16T21:42:06.743213Z broker.info[0]: Run level 1 starting
2016-08-16T21:42:11.770012Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:42:12.237816Z broker.info[0]: Run level 1 Exited (rc=0) 5.5s
2016-08-16T21:42:12.237836Z broker.info[0]: Run level 2 starting
2016-08-16T21:42:12.254588Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:42:12.254657Z broker.info[0]: Run level 3 starting
2016-08-16T21:42:12.265252Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:42:13.366483Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m18.977s
user    0m1.133s
sys 0m1.392s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:47:13.532172Z broker.info[0]: pmi: bootstrap time 0.7s
2016-08-16T21:47:15.282076Z broker.info[0]: nodeset: [0-2303] (complete) 1.7s
2016-08-16T21:47:15.282089Z broker.info[0]: Run level 1 starting
2016-08-16T21:47:20.390378Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:47:20.864776Z broker.info[0]: Run level 1 Exited (rc=0) 5.6s
2016-08-16T21:47:20.864799Z broker.info[0]: Run level 2 starting
2016-08-16T21:47:20.881854Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:47:20.881922Z broker.info[0]: Run level 3 starting
2016-08-16T21:47:20.892905Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:47:22.054511Z broker.info[0]: Run level 3 Exited (rc=0) 1.2s

real    0m20.302s
user    0m1.212s
sys 0m1.379s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:47:57.932496Z broker.info[0]: pmi: bootstrap time 0.4s
2016-08-16T21:47:58.445418Z broker.info[0]: nodeset: [0-2303] (complete) 0.5s
2016-08-16T21:47:58.445437Z broker.info[0]: Run level 1 starting
2016-08-16T21:48:01.979773Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:48:02.528096Z broker.info[0]: Run level 1 Exited (rc=0) 4.1s
2016-08-16T21:48:02.528118Z broker.info[0]: Run level 2 starting
2016-08-16T21:48:02.545963Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:48:02.546035Z broker.info[0]: Run level 3 starting
2016-08-16T21:48:02.557085Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:48:03.659512Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m16.614s
user    0m1.221s
sys 0m1.269s
bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:48:16.124555Z broker.info[0]: pmi: bootstrap time 0.8s
2016-08-16T21:48:17.781674Z broker.info[0]: nodeset: [0-2303] (complete) 1.6s
2016-08-16T21:48:17.781691Z broker.info[0]: Run level 1 starting
2016-08-16T21:48:22.678567Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:48:23.170692Z broker.info[0]: Run level 1 Exited (rc=0) 5.4s
2016-08-16T21:48:23.170712Z broker.info[0]: Run level 2 starting
2016-08-16T21:48:23.187714Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:48:23.187783Z broker.info[0]: Run level 3 starting
2016-08-16T21:48:23.198795Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:48:24.293465Z broker.info[0]: Run level 3 Exited (rc=0) 1.1s

real    0m19.871s
user    0m1.235s
sys 0m1.392s

garlick commented 8 years ago

Oops one more run so we can throw out the numbers for the first cold-cache run above and still have 10x. None of these took more than 21s, to complete.

bash-4.2$ time srun -N2048 -n2304 src/cmd/flux start -o,-Slog-stderr-level=6 /bin/true
2016-08-16T21:50:06.053167Z broker.info[0]: pmi: bootstrap time 0.7s
2016-08-16T21:50:07.782006Z broker.info[0]: nodeset: [0-2303] (complete) 1.7s
2016-08-16T21:50:07.782023Z broker.info[0]: Run level 1 starting
2016-08-16T21:50:12.869093Z cron.info[0]: synchronizing cron tasks to event hb
2016-08-16T21:50:13.347550Z broker.info[0]: Run level 1 Exited (rc=0) 5.6s
2016-08-16T21:50:13.347572Z broker.info[0]: Run level 2 starting
2016-08-16T21:50:13.364337Z broker.info[0]: Run level 2 Exited (rc=0) 0.0s
2016-08-16T21:50:13.364410Z broker.info[0]: Run level 3 starting
2016-08-16T21:50:13.375358Z broker.info[0]: shutdown in 5.000s: run level 2 Exited
2016-08-16T21:50:14.814931Z broker.info[0]: Run level 3 Exited (rc=0) 1.4s

real    0m20.255s
user    0m1.226s
sys 0m1.337s

garlick commented 8 years ago

Here's a baseline:

bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 2.094s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.517s

real    0m9.129s
user    0m1.228s
sys 0m1.333s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.107s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s

real    0m7.666s
user    0m1.224s
sys 0m1.337s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.134s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.495s

real    0m8.016s
user    0m1.220s
sys 0m1.303s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.119s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.175s

real    0m7.675s
user    0m1.179s
sys 0m1.405s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.161s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.331s

real    0m7.864s
user    0m1.288s
sys 0m1.325s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.142s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s

real    0m7.685s
user    0m1.255s
sys 0m1.263s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.198s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.173s

real    0m7.758s
user    0m1.240s
sys 0m1.373s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.152s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.174s

real    0m13.695s
user    0m1.257s
sys 0m1.315s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.091s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.176s

real    0m7.657s
user    0m1.207s
sys 0m1.383s
bash-4.2$ time srun -N2048 -n2304 ./hello.intel
0: completed MPI_Init in 1.173s.  There are 2304 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.176s

real    0m7.720s
user    0m1.273s
sys 0m1.368s

garlick commented 8 years ago

Here's a small scaling study launching flux within an allocation of 2048 nodes with procs varying from 192 to 3071. Times (seconds) are averages over 5 runs. The "hello" column is a baseline run of our mpi hello world program.

Scale	total	pmi	wireup	rc1	hello
192	3.95	0.1	0.0	0.90	1.76
384	4.73	0.2	0.2	0.94	2.33
768	6.88	0.3	0.4	1.6	3.51
1536	14.38	0.3	0.9	2.98	5.81
3071	19.45	0.4	1.2	5.72	8.42

raw data

garlick commented 8 years ago

I think this component of the milestone is complete. I'll leave this open through the rest of the CTS-1 testing period.

flux-framework / distribution

CTS-1 launch flux on full system #13