impi performance fix for MCTWorld:initm

jedwards4b commented 5 years ago

Was having problems with extremely long start up times using impi on large task counts (10,000 to 100,000.) In the process of debugging found that adding this barrier and avoiding an unnecessary allocation on non-root tasks improves the startup performance greatly (down from 2 or more hours to 5 minutes or less). Don't have a good explination for this at this time, but it should be harmless.

rljacob commented 5 years ago

@amametjanov when you get a chance, test that this has no bad side effects on KNL machines with large task-count jobs.

amametjanov commented 4 years ago

@jedwards4b just curious, what were the numbers for cime_pre_init1 in the slow runs? I'm seeing small init times on ALCF-Theta: e.g. from run/timing/model_timing_stats 4.993 seconds for slowest process for 25600 procs on 800 nodes at 32 tasks/node

name                                            on  processes  threads        count      walltotal   wallmax (proc   thrd  )   wallmin (proc   thrd  )
"CPL:cime_pre_init1"                             -      25600    25600 2.560000e+04   1.237269e+05     4.993 (     0      0)     4.673 ( 23515      0)

Call sequence:

cime_driver
- cime_pre_init1
- seq_comm_init
  - mct_world_init

jedwards4b commented 4 years ago

@amametjanov This problem was found in an older version of the model which doesn't have a timer for cime_pre_init1.

amametjanov commented 4 years ago

Checked with impi/2018.up3 intel-mpi module on Cori-KNL with a basic SMS.f19_g16.X.cori-knl_intel case:

SMS_P10000.f19_g16.X, 157 nodes, 64x1, 10k MPI tasks
master
name                                on  processes  threads        count      walltotal   wallmax (proc   thrd  )   wallmin (proc   thrd  )
"CPL:cime_pre_init1"                 -      10000    10000 1.000000e+04   2.362925e+06   236.397 (  3030      0)   236.137 (   130      0)

branch
"CPL:cime_pre_init1"                 -      10000    10000 1.000000e+04   2.554841e+06   255.602 (  2151      0)   255.286 (  5754      0)

So a slight slow-down from 236.397 to 255.602. And then with doubling of tasks:

SMS_P20000.f19_g16.X, 313 nodes, 64x1, 20k MPI tasks
master
name                                on  processes  threads        count      walltotal   wallmax (proc   thrd  )   wallmin (proc   thrd  )
"CPL:cime_pre_init1"                 -      20000    20000 2.000000e+04   1.573237e+07   786.894 ( 19830      0)   786.337 (   680      0)

branch
"CPL:cime_pre_init1"                 -      20000    20000 2.000000e+04   1.556323e+07   778.428 ( 19998      0)   777.861 (   184      0)

A slight speedup from 786.894 to 778.428. So the changes in the branch appear to be okay.

jedwards4b commented 4 years ago

@rljacob can you merge this PR? I have another coming and I would like to add both to cime master.

rljacob commented 4 years ago

@jedwards4b I'm finally merging this but did you still have another PR to do?

rljacob commented 4 years ago

Passed MCT system test with no change in answers

rljacob commented 4 years ago

No change in answers for an F-case but init times are slower (for small node counts) as Az observed. 144 tasks on anvil for ne4 case with this change

    Init Time   :      15.900 seconds 
    Run Time    :      17.782 seconds        3.556 seconds/day 
    Final Time  :       0.047 seconds

Without this change

    Init Time   :      11.461 seconds 
    Run Time    :      17.800 seconds        3.560 seconds/day 
    Final Time  :       0.050 seconds

@jedwards4b are you sure you still need this?

rljacob commented 4 years ago

Taking this off master until it can be tested more.

jedwards4b commented 4 years ago

This is a change in the initialization and should not change overall performance in the least. The results reported above differ by .004 seconds/day which is negligible. It is also on a small node count. It would be more appropriate to run several tests using each configuration and using higher node counts. This problem was found in a high resolution run using some 50K compute processors.

worleyph commented 3 years ago

@jedwards4b , your current fix is plenty fast enough, but wonder if you could reproduce the performance problem with a code just looping over identical calls to MPI_Gather. If so, might be worthwhile forwarding this to the system administrators (to be forwarded to Intel). Would expect MPI_Gather to use a tree algorithm when invoked for a very large number of processes and small per-process data (1 integer), which should be safe. However, if a different algorithm is used and lots of processes are sending directly to root, and if processes from one MPI_Gather start sending messages associated with the next MPI_Gather call (especially to the same root), could see the root being overwhelmed (and slowed down to a crawl). Your barriers prevent this mixing of the MPI_Gather calls, so may be why they solve the problem.

In evaluating your fix, I also looked at adding a call to MPI_BCAST immediately following the MPI_GATHER call (instead of a barrier), broadcasting a zero-length message. This forces all of the other processes to wait until the root is done with the gather before going on to the next gather. This is cheaper than using the barrier (on Cori-KNL for a range of process counts), but I have no idea whether it would eliminate your original problem. And, again, the cost of your additional barriers is very low, so this was just curiosity on my part.

(You could also play with the I_MPI_ADJUST_GATHER, to see whether other algorithm choices exhibit the same problem.)

jedwards4b commented 3 years ago

It was in conjunction with Intel MPI developers on TACC Frontera that this fix was found. We did experiment with several settings of I_MPI_ADJUST_GATHER.

worleyph commented 3 years ago

Thanks for the additional information.

MCSclimate / MCT

impi performance fix for MCTWorld:initm #56