the 2D code now scales poorly on many processors / MPI

komatits commented 8 years ago

The 2D code was carefully optimized for speed and for MPI scaling back in 2007 and 2008 (see http://komatitsch.free.fr/preprints/Vecpar_2008_Roland.pdf ) but it has changed so much in between that it is now slow again and exhibits poor scaling. Vadim @vmont will investigate this when he has time (i.e., not any time soon ;-). Cc'ing people who use the 2D code for real applications and could thus be interested: @vmont @bottero @pcristini

komatits commented 8 years ago

Here is what should be done to make the 2D code fast again:

1/ put inlining of the inner loops in compute_forces(), as suggested by Intel; there is no Deville trick for 2D, only for 3D, but inlining would work; see the attached 3D demo code optimized by Intel, the same should be done in 2D; expecting speedup is 2x or so

2/ simplify and clean the PML codes, which works but (at least in the 3D code) is very complex (indirect addressing, local arrays)

3/ remove all "if" statements from compute_forces, writing a dedicated version of each case instead (i.e. compute_forces_acoustic_no_PML, compute_forces_acoustic_with_PML, same for elastic, same for viscoelastic, same for poroelastic; thus 10 instances of the routine I guess; also consider doing SH waves in separate compute_forces routines, for the same reason; without this, the code will always be slow because the compiler cannot efficiently vectorize the loops

3b/ this should also be done for all the "if (AXISYM) then" statements in compute_forces, which also prevent efficient vectorization

4/ once all of this is done, measure the time spent in each of the compute_forces routines, and use it as a weight in SCOTCH domain decomposition, so that SCOTCH compensates for that; SCOTCH needs integer weights, thus if you measure say 1.2 millisecond, use a weight of 12 i.e. remove the decimal dot.

5/ (already done in 3D, not in 2D) switch to non-blocking MPI instead of blocking, as suggested by Daniel

With all of this, the code would be very fast and efficient again (and also much clearer, easier to maintain, and much closer to the 3D code).

EtienneBachmann commented 8 years ago

Hi Dimitri,

Are you sure 5/ is not done? I always saw non blocking communications on the 2D version ( Isend Irecv mpiWait).

By the way I am not sure about the conventionnal CPU calculation, but for GPU on 2D, communications can be more costy than calculations (removing the MpiWait improves time calculation but the wave propagation is dummy), which is never the case in 3D in my experiences so far. This might not be the case in next generation of Nvidia GPUs, which will involve more global memory. The difference between 2D and 3D is due to the cost of calculation of internal forces, which is lower in 2D. (complexity in o(NGLL^(Ndim)) ). That consideration might be a real bottleneck for strong scaling as you mention it.

Best regards,

Etienne

komatits commented 8 years ago

Hi Etienne,

Thanks! For sure in 2008 the 2D code was non-blocking (see http://komatitsch.free.fr/preprints/Vecpar_2008_Roland.pdf ) but this morning we got an email from Daniel (cc'ed) saying that some of the 2D routines were blocking; that is why I added this.

Daniel, could you clarify which routines are affected?

To all: even in a non-blocking code there is an upper limit to strong scaling when partitions become too small, see http://komatitsch.free.fr/preprints/Vecpar_2008_Roland.pdf Figures 2b and 5b.

Thanks, Dimitri.

On 16/06/2016 21:41, EtienneBachmann wrote:

Hi Dimitri,

Are you sure 5/ is not done? I always saw non blocking communications on the 2D version ( Isend Irecv mpiWait).

By the way I am not sure about the conventionnal CPU calculation, but for GPU on 2D, communications can be more costy than calculations (removing the MpiWait improves time calculation but the wave propagation is dummy), which is never the case in 3D in my experiences so far. This might not be the case in next generation of Nvidia GPUs, which will involve more global memory. The difference between 2D and 3D is due to the cost of calculation of internal forces, wich is lower in 2D. (complexity in o(NGLL^(Ndim)) ). That consideration might be a real bottleneck for strong scaling as you mention it.

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem2d/issues/562#issuecomment-226591716, or mute the thread https://github.com/notifications/unsubscribe/AFjDKS31eWa3HVAH80CgNoupJd6vovpFks5qMabygaJpZM4H_KH2.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr

danielpeter commented 8 years ago

hi all,

:) yes, we’re both right.

Etienne is right that the 2D code uses asynchronous MPI calls MPI_ISEND and MPI_IRECV, but the scheme is still blocking. you can look at e.g. the elastic assembly which is done in the routine assemble_MPI_vector_el(). you will see that it sends/receives buffers and then waits, until everything is received and only then continues. so the scheme is not overlapping the stiffness computation of inner elements with MPI communication (in fact, you could use blocking MPI send/recv calls and the runtime results would likely be similar).

using an overlapping scheme might improve scaling, but we would have to see how much it really helps. again, Etienne is right that the MPI communication becomes more dominant in 2D than in 3D compared to the stiffness computations. so don’t expect too much of an improvement…

well, Figure 5 in the the 2008 paper below shows that blocking scales up to about 16 procs while non-blocking scales up to about 32 procs for the example chosen. to my surprise, there is not much difference in total runtimes. maybe i’m not quite sure how to read that figure, i would have expected to have shorter runtimes for the non-blocking example. anyway, it certainly depends on how you partition the problem, the bigger the partitions the better you can hide the network latency (oh well, figure 2 in that paper with 80 domains would be a very bad choice for partitioning ;) i’m not sure what the reason was to drop the overlapping communication part later on again?

best wishes, daniel

On Jun 16, 2016, at 10:57 PM, Dimitri Komatitsch notifications@github.com wrote:

Hi Etienne,

Thanks! For sure in 2008 the 2D code was non-blocking (see http://komatitsch.free.fr/preprints/Vecpar_2008_Roland.pdf ) but this morning we got an email from Daniel (cc'ed) saying that some of the 2D routines were blocking; that is why I added this.

Daniel, could you clarify which routines are affected?

To all: even in a non-blocking code there is an upper limit to strong scaling when partitions become too small, see http://komatitsch.free.fr/preprints/Vecpar_2008_Roland.pdf Figures 2b and 5b.

Thanks, Dimitri.

On 16/06/2016 21:41, EtienneBachmann wrote:

Hi Dimitri,

Are you sure 5/ is not done? I always saw non blocking communications on the 2D version ( Isend Irecv mpiWait).

By the way I am not sure about the conventionnal CPU calculation, but for GPU on 2D, communications can be more costy than calculations (removing the MpiWait improves time calculation but the wave propagation is dummy), which is never the case in 3D in my experiences so far. This might not be the case in next generation of Nvidia GPUs, which will involve more global memory. The difference between 2D and 3D is due to the cost of calculation of internal forces, wich is lower in 2D. (complexity in o(NGLL^(Ndim)) ). That consideration might be a real bottleneck for strong scaling as you mention it.

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem2d/issues/562#issuecomment-226591716, or mute the thread https://github.com/notifications/unsubscribe/AFjDKS31eWa3HVAH80CgNoupJd6vovpFks5qMabygaJpZM4H_KH2.

Dimitri Komatitsch CNRS Research Director (DR CNRS), Laboratory of Mechanics and Acoustics, UPR 7051, Marseille, France http://komatitsch.free.fr — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

komatits commented 8 years ago

From Daniel @danielpeter : looking at the 2D code, i think point 3/ becomes overkill.

i agree that in terms of performance, putting all if-statements out of the loop would make it faster. in terms of maintenance, it is a nightmare to keep 10+ versions of almost the same routine up-to-date when there is a small change somewhere. so considering the tradeoff between speed and maintenance, in this case i would opt for maintenance.

let me see how we can simplify the compute forces routine a bit and still have it fast enough, but keeping the main stiffness computation in an single routine...

komatits commented 8 years ago

Daniel @danielpeter is working on this; Daniel if you are done please let us know and/or close this issue. Thanks. Dimitri.

komatits commented 7 years ago

More or less done.

SPECFEM / specfem2d

the 2D code now scales poorly on many processors / MPI #562