GPU version of adjoint runs could be made faster

SPECFEM / specfem3d_globe

SPECFEM3D_GLOBE simulates global and regional (continental-scale) seismic wave propagation.

GNU General Public License v3.0

90 stars 96 forks source link

GPU version of adjoint runs could be made faster #585

Closed komatits closed 6 years ago

komatits commented 7 years ago

From Etienne @EtienneBachmann , after fixing most of the GPU slowdown for very large adjoint runs already, here are some new ideas:

To further improve the GPU code, and maybe go below the 2:1 ratio on adjoint runs, here are few suggestions, that are based on my work in 2D :

In the update Newmark routine :

-The kernels update accel and update veloc (that uses accel for the update) can be merged, it will reduce the memory accesses.

For simu_type=3, the division by the mass matrices can be factorized for direct and adjoint wavefields. I am not sure why there are two different mass matrices for direct and adjoint wavefield (d_rmass_outer_core and d_b_rmass_outer_core for instance). It looks like they are filled the same way. Shall we remove the d_brmass* matrices too?

Also, for the compute_forces_acoustic routine, it looks like with the NVIDIA P100 GPU there is a significant speedup merging the calls to direct and adjoint wavefields, around 40%. But if the acoustic elements only represent 5% of the computing time, I am not sure it is worth applying it.

EtienneBachmann commented 6 years ago

Hi all,

Another idea to improve adjoint run speed is to merge the GPU kernels in the compute_kernels routine, where rho kernels and other kernels are separated. It should not affect the readability of the code. Also, it can be wise to introduce a flag COMPUTE_RHO_KERNELS, to avoid their calculation in case they are not needed. The associated computational cost is quite important in the case of acoustic (here outer core), because of the call to compute_gradient routines. To give an idea, on pure acoustic simulation, I obtain a 25% speedup on my purely acoustic adjoint simulation just by commenting the calculation of the rho acoustic kernel. I'm not specialist of the large runs on cluster, but I suspect that even in a perfectly balanced mesh, calibrated to run with more acoustic elements than elastic because of the compute forces routine, when it comes to kernels computation, the acoustic rho kernel calculation becomes an important bottleneck that slows down the whole simulation.

Best regards,

Etienne

komatits commented 6 years ago

Hi Etienne, Hi all,

Thanks a lot. It would be great to implement that.

See also this other suggestion you had made a while ago:
https://github.com/geodynamics/specfem3d/issues/1069

By the way there is also this one in the Git issues, could you let me
know if I can close it?
https://github.com/geodynamics/specfem3d/issues/1078

Thanks, Best, Dimitri.

EtienneBachmann notifications@github.com a écrit :

Hi all,

Another idea to improve adjoint run speed is to merge the GPU
kernels in the compute_kernels routine, where rho kernels and other
kernels are separated. It should not affect the readability of the
code. Also, it can be wise to introduce a flag COMPUTE_RHO_KERNELS, to
avoid their calculation in case they are not needed. The associated
computational cost is quite important in the case of acoustic (here
outer core), because of the call to compute_gradient routines. To
give an idea, on pure acoustic simulation, I obtain a 25% speedup on
my purely acoustic adjoint simulation just by commenting the
calculation of the rho acoustic kernel. I'm not specialist of the
large runs on cluster, but I suspect that even in a perfectly
balanced mesh, calibrated to run with more acoustic elements than
elastic because of the compute forces routine, when it comes to
kernels computation, the acoustic rho kernel calculation becomes an
important bottleneck that slows down the whole simulation.

Best regards,

Etienne

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/geodynamics/specfem3d_globe/issues/585#issuecomment-333656840

komatits commented 6 years ago

Hi Etienne,

PS: I have opened https://github.com/geodynamics/specfem3d/issues/1146

Thanks, Dimitri.

komatitsch@lma.cnrs-mrs.fr a écrit :

Hi Etienne, Hi all,

Thanks a lot. It would be great to implement that.

See also this other suggestion you had made a while ago:
https://github.com/geodynamics/specfem3d/issues/1069

By the way there is also this one in the Git issues, could you let
me know if I can close it?
https://github.com/geodynamics/specfem3d/issues/1078

Thanks, Best, Dimitri.

EtienneBachmann notifications@github.com a écrit :

Hi all,

Another idea to improve adjoint run speed is to merge the GPU
kernels in the compute_kernels routine, where rho kernels and other
kernels are separated. It should not affect the readability of the
code. Also, it can be wise to introduce a flag COMPUTE_RHO_KERNELS, to
avoid their calculation in case they are not needed. The associated
computational cost is quite important in the case of acoustic (here
outer core), because of the call to compute_gradient routines. To
give an idea, on pure acoustic simulation, I obtain a 25% speedup
on my purely acoustic adjoint simulation just by commenting the
calculation of the rho acoustic kernel. I'm not specialist of the
large runs on cluster, but I suspect that even in a perfectly
balanced mesh, calibrated to run with more acoustic elements than
elastic because of the compute forces routine, when it comes to
kernels computation, the acoustic rho kernel calculation becomes an
important bottleneck that slows down the whole simulation.

Best regards,

Etienne

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/geodynamics/specfem3d_globe/issues/585#issuecomment-333656840

EtienneBachmann commented 6 years ago

Hi Dimitri,

Indeed, these remarks are common to all the specfem versions. I am planning to add the features described in this topic in the next weeks, for the cartesian version at least.

For the issue 1078, it has not been done, but I'll do it when I will need to add absorbing conditions in my 3D experiments. We should let it open, because it will be important for people using specfem3D with other goals than exploration geophysics.

Best,

Etienne

komatits commented 6 years ago

Hi Etienne, Hi all,

Thanks a lot! No rush at all. Let us just remember to go over the Git issues from
time to time and close those that have been implemented.

Regarding boundary conditions (Stacey, PML) and time reversal, Vadim
is going to implement UNDO_ATTENUATION in 3D_Cartestian this week
(cutting and pasting it from 3D_GLOBE). This will work for PMLs as
well, and thus no need to write any new routine for PML in the adjoint
case, we will just see it as an absorbing material, which
UNDO_ATTENUATION will reverse without any problem (not even knowing it
:-) I thus closed https://github.com/geodynamics/specfem3d/issues/312

Best wishes, Dimitri.

EtienneBachmann notifications@github.com a écrit :

Hi Dimitri,

Indeed, these remarks are common to all the specfem versions. I am
planning to add the features described in this topic in the next
weeks, for the cartesian version at least.

For the issue 1078, it has not been done, but I'll do it when I will
need to add absorbing conditions in my 3D experiments. We should let
it open, because it will be important for people using specfem3D
with other goals than exploration geophysics.

Best,

Etienne

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/geodynamics/specfem3d_globe/issues/585#issuecomment-333661786

EtienneBachmann commented 6 years ago

Yes, it will be a very nice and useful contribution! Note that the name of flag UNDO_ATTENUATION could be more general, if this option can also serve to use PML in adjoint runs, regardless of attenuation.

komatits commented 6 years ago

Yes, and that option is useful even in the case of elastic runs (no
attenuation) because it has cumulated numerical dispersion for NSTEP
time steps instead of 2*NSTEPS when computing sensitivity kernels /
adjoint runs (see the paragraph right above Section 3 at
http://komatitsch.free.fr/preprints/GJI_undo_attenuation_2016.pdf ). And for large runs or for fluid runs such a difference can matter, in
particular in the case of the Newmark time scheme, which is only
second-order accurate. See for instance
https://github.com/geodynamics/specfem3d/issues/1018

Cheers, Dimitri.

EtienneBachmann notifications@github.com a écrit :

Yes, it will be a very nice and useful contribution! Note that the
name of flag UNDO_ATTENUATION could be more general, if this option
now serves to use PML in adjoint runs, regardless of attenuation.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/geodynamics/specfem3d_globe/issues/585#issuecomment-333667592

danielpeter commented 6 years ago

hi Etienne,

concerning d_b_rmass_outercore, that's a pointer copy for calling backward routines. the idea is to have a clearer code separation for forward and adjoint fields. anything related to backward/reconstructed fields has a b* in front. (the d_ comes from being an array on the device (GPU), compared to h** being on the host (CPU) - although d** is not used coherently).

i guess that removing this pointer and using d_rmass_outer_core instead will lead to a mess in some routine arguments and thus likely to become prone to bugs. anyway, it doesn't need more memory since it's just a pointer.

the pointer copy is done in prepare_mesh_constants_gpu.c: mp->d_b_rmass_outer_core = gpuTakeRef(mp->d_rmass_outer_core);

best, daniel

komatits commented 6 years ago

Hi Daniel and Etienne, Hi all,

Thanks a lot ! Let me thus close this issue in Git.

Thanks, Best regards, Dimitri.

On 10/03/2017 08:17 AM, daniel peter wrote:

hi Etienne,

concerning d_b_rmass_outercore, that's a pointer copy for calling adjoint routines. the idea is to have a clearer code separation for forward and adjoint fields. anything related to backward/reconstructed fields has a b* in front. (the d_ comes from being an array on the device (GPU), compared to h** being on the host (CPU) - although d** is not very coherently used).

i guess that removing this pointer and using d_rmass_outer_core instead will lead to a mess in some routine arguments and thus likely to become prone to bugs. anyway, it doesn't need more memory since it's just a pointer.

the pointer copy is done in prepare_mesh_constants_gpu.c: |mp->d_b_rmass_outer_core = gpuTakeRef(mp->d_rmass_outer_core);|

best, daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/585#issuecomment-333750139, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKS-w1BVkfagPBXmmGMZpJI9cAgolks5sodFugaJpZM4OMrc_.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr