Closed jedwards4b closed 4 years ago
@amametjanov when you get a chance, test that this has no bad side effects on KNL machines with large task-count jobs.
@jedwards4b just curious, what were the numbers for cime_pre_init1
in the slow runs? I'm seeing small init times on ALCF-Theta: e.g. from run/timing/model_timing_stats
4.993 seconds for slowest process for 25600 procs on 800 nodes at 32 tasks/node
name on processes threads count walltotal wallmax (proc thrd ) wallmin (proc thrd )
"CPL:cime_pre_init1" - 25600 25600 2.560000e+04 1.237269e+05 4.993 ( 0 0) 4.673 ( 23515 0)
Call sequence:
@amametjanov This problem was found in an older version of the model which doesn't have a timer for cime_pre_init1.
Checked with impi/2018.up3
intel-mpi module on Cori-KNL with a basic SMS.f19_g16.X.cori-knl_intel
case:
SMS_P10000.f19_g16.X, 157 nodes, 64x1, 10k MPI tasks
master
name on processes threads count walltotal wallmax (proc thrd ) wallmin (proc thrd )
"CPL:cime_pre_init1" - 10000 10000 1.000000e+04 2.362925e+06 236.397 ( 3030 0) 236.137 ( 130 0)
branch
"CPL:cime_pre_init1" - 10000 10000 1.000000e+04 2.554841e+06 255.602 ( 2151 0) 255.286 ( 5754 0)
So a slight slow-down from 236.397 to 255.602. And then with doubling of tasks:
SMS_P20000.f19_g16.X, 313 nodes, 64x1, 20k MPI tasks
master
name on processes threads count walltotal wallmax (proc thrd ) wallmin (proc thrd )
"CPL:cime_pre_init1" - 20000 20000 2.000000e+04 1.573237e+07 786.894 ( 19830 0) 786.337 ( 680 0)
branch
"CPL:cime_pre_init1" - 20000 20000 2.000000e+04 1.556323e+07 778.428 ( 19998 0) 777.861 ( 184 0)
A slight speedup from 786.894 to 778.428. So the changes in the branch appear to be okay.
@rljacob can you merge this PR? I have another coming and I would like to add both to cime master.
@jedwards4b I'm finally merging this but did you still have another PR to do?
Passed MCT system test with no change in answers
No change in answers for an F-case but init times are slower (for small node counts) as Az observed. 144 tasks on anvil for ne4 case with this change
Init Time : 15.900 seconds
Run Time : 17.782 seconds 3.556 seconds/day
Final Time : 0.047 seconds
Without this change
Init Time : 11.461 seconds
Run Time : 17.800 seconds 3.560 seconds/day
Final Time : 0.050 seconds
@jedwards4b are you sure you still need this?
Taking this off master until it can be tested more.
This is a change in the initialization and should not change overall performance in the least. The results reported above differ by .004 seconds/day which is negligible. It is also on a small node count. It would be more appropriate to run several tests using each configuration and using higher node counts. This problem was found in a high resolution run using some 50K compute processors.
@jedwards4b , your current fix is plenty fast enough, but wonder if you could reproduce the performance problem with a code just looping over identical calls to MPI_Gather. If so, might be worthwhile forwarding this to the system administrators (to be forwarded to Intel). Would expect MPI_Gather to use a tree algorithm when invoked for a very large number of processes and small per-process data (1 integer), which should be safe. However, if a different algorithm is used and lots of processes are sending directly to root, and if processes from one MPI_Gather start sending messages associated with the next MPI_Gather call (especially to the same root), could see the root being overwhelmed (and slowed down to a crawl). Your barriers prevent this mixing of the MPI_Gather calls, so may be why they solve the problem.
In evaluating your fix, I also looked at adding a call to MPI_BCAST immediately following the MPI_GATHER call (instead of a barrier), broadcasting a zero-length message. This forces all of the other processes to wait until the root is done with the gather before going on to the next gather. This is cheaper than using the barrier (on Cori-KNL for a range of process counts), but I have no idea whether it would eliminate your original problem. And, again, the cost of your additional barriers is very low, so this was just curiosity on my part.
(You could also play with the I_MPI_ADJUST_GATHER, to see whether other algorithm choices exhibit the same problem.)
It was in conjunction with Intel MPI developers on TACC Frontera that this fix was found. We did experiment with several settings of I_MPI_ADJUST_GATHER.
Thanks for the additional information.
Was having problems with extremely long start up times using impi on large task counts (10,000 to 100,000.) In the process of debugging found that adding this barrier and avoiding an unnecessary allocation on non-root tasks improves the startup performance greatly (down from 2 or more hours to 5 minutes or less). Don't have a good explination for this at this time, but it should be harmless.