Closed ndkeen closed 7 years ago
@ndkeen and I have continued to look at this. The nonscalable performance is occurring in the first call to swapm in pio_rearrange_create_box in the initialization (primarily for ATM in the F case, but also for LND, OCN, ...). There is a second swapm call in the same routine with the same message pattern, and even larger messages, and this is NOT a performance issue. (I placed barriers around both calls, to verify).
This swapm operator has every process sending 1 integer to each of the PIO processes. To avoid being overwhelmed by, for example, 86400 unexpected messages, the handshaking protocol is hardwired on. (I also tried disabling handshaking, the cost of this operation at least doubled, so handshaking is a good thing in this case.)
My current hypothesis is that there is overhead in the MPI layer when a process receives a message from another process the first time (and perhaps for each communicator). If so, we should be able to measure that with a simple benchmark. I'll write one that has a message pattern similar to what we are doing in pio_rearrange_create_box, and see if it has similar behavior.
If this is the problem, unless the MPI library developers can do something differently, we may need to make sure that the process count does not get too high, using more threading instead.
I tried replacing the explicit point-to-point communication algorithm used in the two swapm calls in pio_rearrange_create_box with MPI_Alltoallv. Cost of the routine containing these two calls (compute_counts) dropped from 112 seconds to 1 second for the atmosphere initialization in an ne120_ne120 F case with a 10800x1 decomposition. However, overall initialization cost increased by 50%. TImer coverage was insufficient to capture where the new performance bottleneck was (a couple of places at least, including someplace in clm_init1).
I then replaced the first swapm call with MPI_AlltoallV, but left the second as a swapm call. Here the first MPI_Alltoallv was (again) very fast, and the second swapm call was slow (22 seconds in ATM, and 394 seconds in LND).
So, the current hypothesis is that
a) MPI_AlltoallV is very fast (and we should make sure to investigate exploiting this on Cori-KNL in other locations in the code).
b) The first instance of point-to-point communication between two processes will be expensive. So, unless we can replace ALL point-to-point communication with collective calls, this overhead will show up somewhere.
If we can replace all global-like communication with collectives, this might help. Leaving the more local communication as point-to-point should be low cost? I don't know if this is possible however - but it is easy enough to find the nonlocal communication patterns one at a time - keep running until find the expensive timers, eliminate it, then try again.
Decreasing the number of MPI processes may be simpler, at least in the short-term. This is also ONLY initialization cost (or first call if we pushed it into the run loop).
@ndkeen , the case you gave me targets Cori-Haswell (though it runs on the KNL nodes). Do you have an example that targets Cori-KNL? I want to remove any chance that the MPI library might behave differently when building with a Cori-KNL target.
FYI - the slide deck you shared with me indicated that
Frequent reboots are not encouraged, as they currently take a long time
so perhaps the long start up times occur when the ACME quad,cache job is scheduled on nodes that run a quad,flat job previously (is this how I should interpret the comment?). Also says that the default is quad,flat .
I also see the suggestion of using 'sbcast' to put the executable in /tmp before running. You are not doing this. I'm trying it out right now. Just curious what your opinion/experience is.
so perhaps the long start up times occur when the ACME quad,cache job is scheduled on nodes that run a quad,flat job previously
Never mind - this does not match the evidence (that all large times, when they occur are in the preprocessing steps, with namelists slooooowly being created, and before the 'model run started' output).
There are two modules: craype-haswell and craype-mic-knl, with haswell as default. When loaded, they help ensure the right flags are there to be optimized for the target platform. For KNL, you can leave it craype-haswell, and manually add flag, which is what I've done in my branches. The reason I went this way was because the OCN was not building with mic-knl (because it builds at least one small exe that needs to run on login node). I also do a little cheat where all C builds are optimized for BOTH haswell and KNL (which only adds a little extra size to exe). Most of our builds are fortran, of course. Note the login nodes are haswell nodes.
You can easily test using craype-mic-knl. Just change env_mach_specifc to use that one instead. You can then remove (or leave in) the compiler flags that start with -x.
There are 2 major modes for the KNL nodes: quad,cache and quad,flat. Right now, most of them are cache, so it makes most sense to use that. They have left some quad,flat, but it doesn't look like a terribly interesting experiment for ACME, so I would only use that if you are trying to beat the Q.
I heard that sbcast was mostly useful for larger MPI jobs to aid in getting the exe out to the nodes. So this would probably NOT affect the init time as this would be before MPI_Init is called.
@ndkeen , thank you. I just now finished trying the Intel MPI library. Initialization was somewhat faster, but the detailed timers that I have been looking at did not change at all, so probably irrelevant or random variability? I'm trying building/running with the craype-mic-knl module now, just to check that theere is not something weird going on with the version of MPI library being loaded.
@ndkeen, latest experiment (using craype-mic-knl ) changed nothing. FYI.
Well that's good to know. I do have more timing data to add to this discussion, just need to compile it.
My only modification that improved performance was to increase the hard-wired MAXREQS used in the compute_counts routine, and this was not a big difference, at least for the case I was trying. I've been focusing on the 10800x1 case, since this runs quickly but still has the timers that demonstrate the problem. I'll try a larger case next. Note, that I am not optimistic, as I am running out of things to try.
Closing this as we now have: https://github.com/ACME-Climate/ACME/issues/1578
Running on KNL nodes of Cori, noticed the Init time for the F cases was very large. Note that I ran with a PE layout such that all components had the same number of MPI tasks and I used all 68 cores per node on the KNL. I'm also trying running with 64 cores per node.
create_newcase -case /global/cscratch1/sd/ndk/acme_scratch/SMS.ne120_ne120.m20n1271p86400tb-pw -res ne120_ne120 -mach cori-knl -compiler intel -compset FC5AV1C-04P -project m1704
This is the "Init Time" as reported by the
timing/cesm_timing.SMS.ne1*
file. I also include the overall SYPD for these same runs (not related to Init Time).I tried a patch put together by Pat to address some Init slowness, but it didn't help. To try this, I replaced the following files from Pat directly:
Makefile m_GlobalSegMap.F90 m_MatAttrVectMul.F90 m_Rearranger.F90 m_Router.F90 m_SPMDutils.F90
I also increased the amount of timing data collected.
And I'm attaching some profile data after adding more detailed timers. This is for the 86400-way run. Pat is giving me more timers to add.
cesm_timing.00000.txt