E3SM-Project / HOMMEXX

Clone of ACME for CMDV-SE project to convert HOMME to C++
11 stars 0 forks source link

Caar parallel scan #295

Closed bartgol closed 6 years ago

bartgol commented 6 years ago

This PR introduces the use of Kokkos::parallel_scan with ThreadVectorRange. It does so only on GPU, since on CPU/KNL that would not really be parallel, so there's no point in adding complexity.

I'm submitting this PR, but I still have to run the (perf) tests. So far I only ran unit tests. I building the baselines and run test tomorrow.

This addresses issue #288

bartgol commented 6 years ago

Timings on P100 (single node), ne=4, ndays=2. Runs were interleaved, but I grouped them for readability

master:

routine time
prim_main_loop 3.080
tl-ae U3-5stage_timestep 0.689
tl-ae advance_hypervis_dp 0.652
tl-at prim_advec_tracers_remap_RK2 0.625
tl-sc vertical_remap 0.080
prim_main_loop 3.091
tl-ae U3-5stage_timestep 0.689
tl-ae advance_hypervis_dp 0.652
tl-at prim_advec_tracers_remap_RK2 0.624
tl-sc vertical_remap 0.081
prim_main_loop 3.071
tl-ae U3-5stage_timestep 0.687
tl-ae advance_hypervis_dp 0.651
tl-at prim_advec_tracers_remap_RK2 0.623
tl-sc vertical_remap 0.081

branch:

routine time
prim_main_loop 2.957
tl-ae U3-5stage_timestep 0.570
tl-ae advance_hypervis_dp 0.654
tl-at prim_advec_tracers_remap_RK2 0.624
tl-sc vertical_remap 0.081
prim_main_loop 2.947
tl-ae U3-5stage_timestep 0.568
tl-ae advance_hypervis_dp 0.652
tl-at prim_advec_tracers_remap_RK2 0.626
tl-sc vertical_remap 0.081
prim_main_loop 2.949
tl-ae U3-5stage_timestep 0.568
tl-ae advance_hypervis_dp 0.651
tl-at prim_advec_tracers_remap_RK2 0.623
tl-sc vertical_remap 0.080