Closed mfdeakin-sandia closed 6 years ago
Skylake performance, with 1 node, 48 processes, 96 elements, 40 tracers
master.1:prim_main_loop 1.979
master.1:tl-ae U3-5stage_timestep 0.151
master.1:tl-ae advance_hypervis_dp 0.092
master.1:tl-at prim_advec_tracers_remap_RK2 1.683
master.1:tl-sc vertical_remap 0.091
master.2:prim_main_loop 1.996
master.2:tl-ae U3-5stage_timestep 0.164
master.2:tl-ae advance_hypervis_dp 0.091
master.2:tl-at prim_advec_tracers_remap_RK2 1.701
master.2:tl-sc vertical_remap 0.093
master.3:prim_main_loop 2.002
master.3:tl-ae U3-5stage_timestep 0.157
master.3:tl-ae advance_hypervis_dp 0.093
master.3:tl-at prim_advec_tracers_remap_RK2 1.708
master.3:tl-sc vertical_remap 0.091
master.4:prim_main_loop 1.992
master.4:tl-ae U3-5stage_timestep 0.140
master.4:tl-ae advance_hypervis_dp 0.096
master.4:tl-at prim_advec_tracers_remap_RK2 1.696
master.4:tl-sc vertical_remap 0.092
master.5:prim_main_loop 1.998
master.5:tl-ae U3-5stage_timestep 0.152
master.5:tl-ae advance_hypervis_dp 0.093
master.5:tl-at prim_advec_tracers_remap_RK2 1.702
master.5:tl-sc vertical_remap 0.092
remap_update.1:prim_main_loop 2.001
remap_update.1:tl-ae U3-5stage_timestep 0.168
remap_update.1:tl-ae advance_hypervis_dp 0.093
remap_update.1:tl-at prim_advec_tracers_remap_RK2 1.706
remap_update.1:tl-sc vertical_remap 0.092
remap_update.2:prim_main_loop 1.991
remap_update.2:tl-ae U3-5stage_timestep 0.146
remap_update.2:tl-ae advance_hypervis_dp 0.093
remap_update.2:tl-at prim_advec_tracers_remap_RK2 1.693
remap_update.2:tl-sc vertical_remap 0.092
remap_update.3:prim_main_loop 2.001
remap_update.3:tl-ae U3-5stage_timestep 0.158
remap_update.3:tl-ae advance_hypervis_dp 0.092
remap_update.3:tl-at prim_advec_tracers_remap_RK2 1.705
remap_update.3:tl-sc vertical_remap 0.093
remap_update.4:prim_main_loop 1.998
remap_update.4:tl-ae U3-5stage_timestep 0.150
remap_update.4:tl-ae advance_hypervis_dp 0.091
remap_update.4:tl-at prim_advec_tracers_remap_RK2 1.705
remap_update.4:tl-sc vertical_remap 0.092
remap_update.5:prim_main_loop 1.994
remap_update.5:tl-ae U3-5stage_timestep 0.146
remap_update.5:tl-ae advance_hypervis_dp 0.092
remap_update.5:tl-at prim_advec_tracers_remap_RK2 1.698
remap_update.5:tl-sc vertical_remap 0.093
P100 Performance, 1 GPU, 1 process, 96 elements, 40 tracers:
master.1:prim_main_loop 2.466
master.1:tl-ae U3-5stage_timestep 0.347
master.1:tl-ae advance_hypervis_dp 0.325
master.1:tl-at prim_advec_tracers_remap_RK2 1.598
master.1:tl-sc vertical_remap 0.141
master.2:prim_main_loop 2.474
master.2:tl-ae U3-5stage_timestep 0.345
master.2:tl-ae advance_hypervis_dp 0.324
master.2:tl-at prim_advec_tracers_remap_RK2 1.595
master.2:tl-sc vertical_remap 0.141
master.3:prim_main_loop 2.458
master.3:tl-ae U3-5stage_timestep 0.345
master.3:tl-ae advance_hypervis_dp 0.323
master.3:tl-at prim_advec_tracers_remap_RK2 1.594
master.3:tl-sc vertical_remap 0.141
master.4:prim_main_loop 2.460
master.4:tl-ae U3-5stage_timestep 0.345
master.4:tl-ae advance_hypervis_dp 0.324
master.4:tl-at prim_advec_tracers_remap_RK2 1.595
master.4:tl-sc vertical_remap 0.141
master.5:prim_main_loop 2.478
master.5:tl-ae U3-5stage_timestep 0.346
master.5:tl-ae advance_hypervis_dp 0.326
master.5:tl-at prim_advec_tracers_remap_RK2 1.597
master.5:tl-sc vertical_remap 0.141
remap_update.1:prim_main_loop 2.478
remap_update.1:tl-ae U3-5stage_timestep 0.345
remap_update.1:tl-ae advance_hypervis_dp 0.324
remap_update.1:tl-at prim_advec_tracers_remap_RK2 1.596
remap_update.1:tl-sc vertical_remap 0.141
remap_update.2:prim_main_loop 2.460
remap_update.2:tl-ae U3-5stage_timestep 0.344
remap_update.2:tl-ae advance_hypervis_dp 0.324
remap_update.2:tl-at prim_advec_tracers_remap_RK2 1.594
remap_update.2:tl-sc vertical_remap 0.141
remap_update.3:prim_main_loop 2.481
remap_update.3:tl-ae U3-5stage_timestep 0.346
remap_update.3:tl-ae advance_hypervis_dp 0.326
remap_update.3:tl-at prim_advec_tracers_remap_RK2 1.597
remap_update.3:tl-sc vertical_remap 0.141
remap_update.4:prim_main_loop 2.463
remap_update.4:tl-ae U3-5stage_timestep 0.345
remap_update.4:tl-ae advance_hypervis_dp 0.325
remap_update.4:tl-at prim_advec_tracers_remap_RK2 1.596
remap_update.4:tl-sc vertical_remap 0.141
remap_update.5:prim_main_loop 2.478
remap_update.5:tl-ae U3-5stage_timestep 0.345
remap_update.5:tl-ae advance_hypervis_dp 0.325
remap_update.5:tl-at prim_advec_tracers_remap_RK2 1.596
remap_update.5:tl-sc vertical_remap 0.141
So performance looks unchanged
This passes all tests on Skylake, so I think this is ready to merge
Please excuse another possibly irrelevant, out-of-context, comment, but in E3SM/CESM proper explicit typing of constants was at one time a requirement, and switching precision was implemented by redefining model-specific types. An example from shr_reprosum_mod.F90 follows. (Can't find an example switching from single to double precision floating point at the moment.)
! Workaround for when shr_kind_i8 is not supported. use shr_kind_mod, only: r8 => shr_kind_r8, i8 => shr_kind_i4
use shr_kind_mod, only: r8 => shr_kind_r8, i8 => shr_kind_i8
... use_ddpdd_sum = use_ddpdd_sum .or. (radix(0._r8) /= radix(0_i8)) ...
Pat
On 4/25/18 2:56 PM, onguba wrote:
@oksanaguba commented on this pull request.
In components/homme/src/share/vertremap_mod_base.F90 https://github.com/E3SM-Project/HOMMEXX/pull/294#discussion_r184171344:
rslt( 4,j) = dx(j) / ( dx(j) + dx(j+1) )
- rslt( 5,j) = 1. / sum( dx(j-1:j+2) )
- rslt( 6,j) = ( 2. dx(j+1) dx(j) ) / ( dx(j) + dx(j+1 ) )
- rslt( 7,j) = ( dx(j-1) + dx(j ) ) / ( 2. * dx(j ) + dx(j+1) )
- rslt( 8,j) = ( dx(j+2) + dx(j+1) ) / ( 2. * dx(j+1) + dx(j ) )
- rslt( 9,j) = dx(j ) ( dx(j-1) + dx(j ) ) / ( 2.dx(j ) + dx(j+1) )
- rslt(10,j) = dx(j+1) ( dx(j+1) + dx(j+2) ) / ( dx(j ) + 2.dx(j+1) )
- rslt( 5,j) = 1.D0 / sum( dx(j-1:j+2) )
i was told to not use .D format. sometimes models are run with single precision, and 2. will be converted to double/single as needed.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/HOMMEXX/pull/294#pullrequestreview-115311762, or mute the thread https://github.com/notifications/unsubscribe-auth/AHghFMcFHcU8ktvRlXBgbr_GuLivmIPoks5tsMb7gaJpZM4TifG0.
This implements the new vertical remap boundary conditions from E3SM's version of Homme. This goes a bit ahead of Homme by completely removing the old
compute_ppm_grids
subroutine from the Fortran, which is supposed to be a non-BFB change. No performance comparisons yet, but I'm not expecting any change. Fixes #279