slowdown for runs with a very large number of receivers (even worse if they are unevenly distributed, which is almost always the case)

SPECFEM / specfem3d_globe

SPECFEM3D_GLOBE simulates global and regional (continental-scale) seismic wave propagation.

GNU General Public License v3.0

90 stars 96 forks source link

slowdown for runs with a very large number of receivers (even worse if they are unevenly distributed, which is almost always the case) #580

Closed komatits closed 7 years ago

komatits commented 7 years ago

From Etienne @EtienneBachmann , from https://github.com/geodynamics/specfem3d/issues/1008

Perfect, I'll work on it. Regarding splitting the arrays, the job is done in acoustics, I work on the elastic case now. Results look encouraging, with a run 25% faster in a small one MPI slice case (in acoustics so). Note that my new implementation will not decrease that much I/O with the disks (we still need to read all the data). But aside I/O with the disks, proceeding like this also avoids large I/O CPU <==> GPU at each time step, which are also quite costy, and even more when it will come to sets of receivers unequally distributed among MPI slices. Much less data have to be transmitted (NGLL3 times less). This may be the actual bottleneck that @ebrubozdag is experiencing.

komatits commented 7 years ago

Great, thanks! The improvements will be useful.

Regarding Ebru's @ebrubozdag runs there is also the imbalance of receiver distributions between MPI slices, see the attached histogram I created using her input files (basically dense arrays in some parts of the Earth, almost nothing in the oceans, but SPECFEM3D_GLOBE decomposes the surface of the cubed sphere evenly...). She is probably seeing a combination of both slowdowns. The second is not easy to fix (Malte was working on it).

Best regards, Dimitri. @danielpeter @mpbl @schirwon confirmation_that_there_is_very_significant_imbalance_in_the_number_of_receivers_per_slice

komatits commented 7 years ago

To all: Note that Etienne @EtienneBachmann will improve that in SPECFEM3D_Cartesian; someone will then need to cut and paste it into SPECFEM3D_GLOBE to fix the slowdowns in Ebru's runs. @schirwon

EtienneBachmann commented 7 years ago

Hi all,

Finally, I am working on fixing this problem in SPECFEM3D_GLOBE too. The work consists in 1) splitting the array adj_sourcearrays in smaller ones. 2) adding GPU support for seismograms,

The goal is to reduce, in both cases, the amount of data to transfer between CPU and GPU, and get extra RAM / GPU global memory.

I will first focus on 1). As I don't have experience in running massively parallel jobs ( I usually launch jobs up to 4 MPI slices ), your help/suggestions regarding the way to implement it are welcome!

One remark regarding the future implementation : I saw that compared to the cartesian version, asynchronous transfers are used at each time step to load adj_sourcearrays_slice in GPU memory. In my future commit, the new array (source_adjoint) will be NGLL3 times smaller. This array is loaded entirely before the timerun in the 2D version and every NTSTEP_BETWEEN_READ_ADJSRC time step in 3D cartesian. My guess is that we still should prefer in 3D globe version a load at every time step?

Aside, the BOAST version I have is slightly higher (2.0.2) than the corresponding generated kernels in devel version. Should I update all the generated kernels ? And do I have to update the reference CUDA kernels (in boast/references) ?

Thanks, Best regards,

Etienne

komatits commented 7 years ago

Hi Etienne,

Thanks! Yes, I think it is a good idea to upgrade to the latest version of BOAST, but Daniel (cc'ed) can confirm.

Regarding how often seismograms are transferred, I guess the same implementation as in 3D_Cartesian would probably be best, i.e. every NTSTEP_BETWEEN_READ_ADJSRC time step (again, Daniel can let us know, he has worked on that in the past). Usually we try to have very similar implementations (ideally: identical) between 3D_Cartesian and 3D_GLOBE.

Thanks, Best wishes, Dimitri.

On 06/12/2017 11:24 PM, EtienneBachmann wrote:

Hi all,

Finally, I am working on fixing this problem in SPECFEM3D_GLOBE too. The work consists in

splitting the array adj_sourcearrays in smaller ones.

adding GPU support for seismograms,

The goal is to reduce, in both cases, the amount of data to transfer between CPU and GPU, and get extra RAM / GPU global memory.

I will first focus on 1). As I don't have experience in running massively parallel jobs ( I usually launch jobs up to 4 MPI slices ), your help/suggestions regarding the way to implement it are welcome!

One remark regarding the future implementation : I saw that compared to the cartesian version, asynchronous transfers are used at each time step to load adj_sourcearrays_slice in GPU memory. In my future commit, the new array (source_adjoint) will be NGLL3 times smaller. This array is loaded entirely before the timerun in the 2D version and every NTSTEP_BETWEEN_READ_ADJSRC time step in 3D cartesian. My guess is that we still should prefer in 3D globe version a load at every time step?

Aside, the BOAST version I have is slightly higher (2.0.2) than the corresponding generated kernels in devel version. Should I update all the generated kernels ? And do I have to update the reference CUDA kernels (in boast/references) ?

Thanks, Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-307933652, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKRwwVha2qY0xT4-E17qGTo2tCqTvks5sDayZgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

EtienneBachmann commented 7 years ago

Hi Dimitri,

Alright, but I guess for sake of performance + getting more memory, it is better to transfer the adjoint source (no seismograms for the moment) every timestep. But I am not sure of it and all the potential speedup may depend of it, I haven't experimented async. transfers so far. If so, I can update the 3D cartesian accordingly (this is very easy to do).

Best regards,

Etienne

komatits commented 7 years ago

Hi Etienne, Hi all,

Thanks a lot. Let me cc Daniel, Max, Peter, Brice and also Vadim, who are the best experts in terms of the current structure of the GPU part of 3D_Cartesian and of 3D_GLOBE (depending on their respective projects: Brice knows 3D_GLOBE very well but does not use 3D_Cartesian, for Vadim it is the opposite). I suggest you call at least Daniel on Skype to see what is best to do (I am not sure transferring at each time step is optimal, but I am not an expert).

Thanks! Best wishes, Dimitri.

On 06/13/2017 12:35 AM, EtienneBachmann wrote:

Hi Dimitri,

Alright, but I guess for sake of performance + getting more memory, it is better to transfer the adjoint source (no seismograms for the moment) every timestep. But I am not sure of it and all the potential speedup may depend of it, I haven't experimented async. transfers so far. If so, I can update the 3D cartesian accordingly (this is very easy to do).

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-307953239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKUwPFEmq6qe0RdY2sAsmmxQgvc8Yks5sDb1LgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

komatits commented 7 years ago

Hi all,

Let me also cc Malte, who is a GPU expert (but I am not sure if he has tested that adjoint source part already).

Cheers, Dimitri.

On 06/13/2017 03:37 PM, Dimitri Komatitsch wrote:

Hi Etienne, Hi all,

Thanks a lot. Let me cc Daniel, Max, Peter, Brice and also Vadim, who are the best experts in terms of the current structure of the GPU part of 3D_Cartesian and of 3D_GLOBE (depending on their respective projects: Brice knows 3D_GLOBE very well but does not use 3D_Cartesian, for Vadim it is the opposite). I suggest you call at least Daniel on Skype to see what is best to do (I am not sure transferring at each time step is optimal, but I am not an expert).

Thanks! Best wishes, Dimitri.

On 06/13/2017 12:35 AM, EtienneBachmann wrote:

Hi Dimitri,

Alright, but I guess for sake of performance + getting more memory, it is better to transfer the adjoint source (no seismograms for the moment) every timestep. But I am not sure of it and all the potential speedup may depend of it, I haven't experimented async. transfers so far. If so, I can update the 3D cartesian accordingly (this is very easy to do).

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-307953239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKUwPFEmq6qe0RdY2sAsmmxQgvc8Yks5sDb1LgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

komatits commented 7 years ago

Hi again all,

Regarding the very significant imbalance in Ebru's runs in the above figure (also true for anyone running at the global scale) in terms of the distribution of receivers and thus of adjoint sources between MPI slices, I see two possible options:

1/ drastically speeding up the recording of seismograms (already done I guess, or close to done?), and then drastically speeding up the transfer and use of adjoint sources (what Etienne is trying to do right now).

2/ we could do these calculations (if expensive) on other processors, i.e.

balance the receivers evenly between MPI slices, ignoring their real physical location
create a dedicated MPI communicator based on that
every time we have to compute stuff at receivers and/or use adjoint sources, send the calculations to the balanced set of stations using that dedicated receiver, and then get the result back
this would balance that part of the calculations, if we cannot make its cost close to negligible based on 1/

Of course 1/ would be much better, 2/ is a significant modification and a bit painful to implement I guess (at least not straightforward), and it leads to numerous additional MPI communications (of small volumes though).

Best wishes, Dimitri.

On 06/13/2017 03:37 PM, Dimitri Komatitsch wrote:

Hi Etienne, Hi all,

Thanks a lot. Let me cc Daniel, Max, Peter, Brice and also Vadim, who are the best experts in terms of the current structure of the GPU part of 3D_Cartesian and of 3D_GLOBE (depending on their respective projects: Brice knows 3D_GLOBE very well but does not use 3D_Cartesian, for Vadim it is the opposite). I suggest you call at least Daniel on Skype to see what is best to do (I am not sure transferring at each time step is optimal, but I am not an expert).

Thanks! Best wishes, Dimitri.

On 06/13/2017 12:35 AM, EtienneBachmann wrote:

Hi Dimitri,

Alright, but I guess for sake of performance + getting more memory, it is better to transfer the adjoint source (no seismograms for the moment) every timestep. But I am not sure of it and all the potential speedup may depend of it, I haven't experimented async. transfers so far. If so, I can update the 3D cartesian accordingly (this is very easy to do).

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-307953239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKUwPFEmq6qe0RdY2sAsmmxQgvc8Yks5sDb1LgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

komatits commented 7 years ago

Using the right email address for Max.

Dimitri.

On 06/13/2017 03:49 PM, Dimitri Komatitsch wrote:

Hi again all,

Regarding the very significant imbalance in Ebru's runs in the above figure (also true for anyone running at the global scale) in terms of the distribution of receivers and thus of adjoint sources between MPI slices, I see two possible options:

1/ drastically speeding up the recording of seismograms (already done I guess, or close to done?), and then drastically speeding up the transfer and use of adjoint sources (what Etienne is trying to do right now).

or

2/ we could do these calculations (if expensive) on other processors, i.e.

balance the receivers evenly between MPI slices, ignoring their real physical location

create a dedicated MPI communicator based on that

every time we have to compute stuff at receivers and/or use adjoint sources, send the calculations to the balanced set of stations using that dedicated receiver, and then get the result back

this would balance that part of the calculations, if we cannot make its cost close to negligible based on 1/

Of course 1/ would be much better, 2/ is a significant modification and a bit painful to implement I guess (at least not straightforward), and it leads to numerous additional MPI communications (of small volumes though).

Best wishes, Dimitri.

On 06/13/2017 03:37 PM, Dimitri Komatitsch wrote:

Hi Etienne, Hi all,

Thanks a lot. Let me cc Daniel, Max, Peter, Brice and also Vadim, who are the best experts in terms of the current structure of the GPU part of 3D_Cartesian and of 3D_GLOBE (depending on their respective projects: Brice knows 3D_GLOBE very well but does not use 3D_Cartesian, for Vadim it is the opposite). I suggest you call at least Daniel on Skype to see what is best to do (I am not sure transferring at each time step is optimal, but I am not an expert).

Thanks! Best wishes, Dimitri.

On 06/13/2017 12:35 AM, EtienneBachmann wrote:

Hi Dimitri,

Alright, but I guess for sake of performance + getting more memory, it is better to transfer the adjoint source (no seismograms for the moment) every timestep. But I am not sure of it and all the potential speedup may depend of it, I haven't experimented async. transfers so far. If so, I can update the 3D cartesian accordingly (this is very easy to do).

Best regards,

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-307953239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKUwPFEmq6qe0RdY2sAsmmxQgvc8Yks5sDb1LgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

EtienneBachmann commented 7 years ago

Hi all,

I finished to implement splitting the adjoint arrays. I'll commit it once I'll be sure about what should be committed. I started adding GPU support for seismograms. If I understood well the code, the current async transfer is not useful (the async copy is directly followed by a stream synchronize, which is the next subroutine but no other operations are done between copy and sync. ). To simplify my job, I'll only implement it for sim_type = 1 or 3.

Etienne

komatits commented 7 years ago

Hi Etienne, Hi all,

Great! I also think I remember that asynchronous transfers are not useful in that case because we use the information right after posting the transfer, i.e. it becomes blocking in practice; Daniel is more familiar with that part of the code than I am and can thus confirm (or not).

Thanks, Best wishes,

Dimitri.

On 06/15/2017 01:39 AM, EtienneBachmann wrote:

Hi all,

I finished to implement splitting the adjoint arrays. I'll commit it once I'll be sure about what should be committed. I started adding GPU support for seismograms. If I understood well the code, the current async transfer is not useful (the async copy is directly followed by a stream synchronize, which is the next subroutine but no other operations are done between copy and sync. ). To simplify my job, I'll only implement it for sim_type = 1 or 3.

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-308588195, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKXVLgtTqGigIJ3V7zqQI6AfigWS-ks5sEG9BgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

danielpeter commented 7 years ago

hi Dimitri & Etienne,

i''m not sure if the async is followed by a sync right after. it might depend on the user setting choices but i'll check when back at work (in 1 week). my exprience was that the async did help with the adjoint reading part.

best, daniel

On Jun 17, 2017, at 16:43, Dimitri Komatitsch notifications@github.com wrote:

Hi Etienne, Hi all,

Great! I also think I remember that asynchronous transfers are not useful in that case because we use the information right after posting the transfer, i.e. it becomes blocking in practice; Daniel is more familiar with that part of the code than I am and can thus confirm (or not).

Thanks, Best wishes,

Dimitri.

On 06/15/2017 01:39 AM, EtienneBachmann wrote:

Hi all,

I finished to implement splitting the adjoint arrays. I'll commit it once I'll be sure about what should be committed. I started adding GPU support for seismograms. If I understood well the code, the current async transfer is not useful (the async copy is directly followed by a stream synchronize, which is the next subroutine but no other operations are done between copy and sync. ). To simplify my job, I'll only implement it for sim_type = 1 or 3.

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-308588195, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKXVLgtTqGigIJ3V7zqQI6AfigWS-ks5sEG9BgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

komatits commented 7 years ago

Hi Daniel,

Thanks! Please keep us posted when you are back at work.

Thanks, Best,

Dimitri.

On 06/17/2017 08:38 PM, daniel peter wrote:

hi Dimitri & Etienne,

i''m not sure if the async is followed by a sync right after. it might depend on the user setting choices but i'll check when back at work (in 1 week). my exprience was that the async did help with the adjoint reading part.

best, daniel

On Jun 17, 2017, at 16:43, Dimitri Komatitsch notifications@github.com wrote:

Hi Etienne, Hi all,

Great! I also think I remember that asynchronous transfers are not useful in that case because we use the information right after posting the transfer, i.e. it becomes blocking in practice; Daniel is more familiar with that part of the code than I am and can thus confirm (or not).

Thanks, Best wishes,

Dimitri.

On 06/15/2017 01:39 AM, EtienneBachmann wrote:

Hi all,

I finished to implement splitting the adjoint arrays. I'll commit it once I'll be sure about what should be committed. I started adding GPU support for seismograms. If I understood well the code, the current async transfer is not useful (the async copy is directly followed by a stream synchronize, which is the next subroutine but no other operations are done between copy and sync. ). To simplify my job, I'll only implement it for sim_type = 1 or 3.

Etienne

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-308588195,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AFjDKXVLgtTqGigIJ3V7zqQI6AfigWS-ks5sEG9BgaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geodynamics/specfem3d_globe/issues/580#issuecomment-309232534, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjDKR26Zqt97RHvy5i3AnW9IYbTWnVbks5sFB0_gaJpZM4NzBiP.

-- Dimitri Komatitsch, CNRS Research Director (DR CNRS) Laboratory of Mechanics and Acoustics, Marseille, France http://komatitsch.free.fr

komatits commented 7 years ago

From Youyi :

Just a quick update, the latest test runs of specfem3D_globe with your modifications and the one we currently using for our global adjoint tomography did not show performance improvement in forward simulation (new 8min38s vs old 8min05s), but I am positive that the results are correct. However, the adjoint simulation showed SIGNIFICANT improvement: 17min19s (new) vs 26min39s (old), almost 35% reduction. Now the forward and adjoint run time ratio reduced 1:2, seems close to what we expected ideally. I will keep you posted when the correctness check of adios kernel is done.

Best,

Youyi

komatits commented 7 years ago

Hi Youyi and Etienne, Hi all,

This is really excellent news, and great contribution by Etienne. As you say, since the ratio adjoint/forward is now close to 2 in terms of cost, we are all set and we cannot do much better than that (It will never be exactly 2 because we do a few more things: reading and using adjoint sources, computing kernels on the fly etc).

This also means that issue https://github.com/geodynamics/specfem3d_globe/issues/580 about the very uneven distribution of receivers at the surface of the Earth, and the resulting very significant imbalance, does not play any significant role any more, now that handling adjoint sources has become almost instantaneous). This is good news, because fixing that would have been difficult, since SPECFEM3D_GLOBE uses a static geometrical decomposition of the cubed sphere. Let me close that Git issue as well then, and Etienne can commit his nice contribution.

Thanks, Dimitri.

komatits commented 7 years ago

Solved by Etienne @EtienneBachmann (indirectly, see above)!