Enable GPU execution of atm_recover_large_step_variables via OpenACC

MPAS-Dev / MPAS-Model

Repository for MPAS models and shared framework releases.

238 stars 317 forks source link

Enable GPU execution of atm_recover_large_step_variables via OpenACC #1220

Open gdicker1 opened 1 month ago

gdicker1 commented 1 month ago

This PR enables GPU calculation of post-acoustic step variable reconstitution by adding OpenACC directives to the atm_recover_large_step_variables_work routine.

Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer: atm_recover_large_step_variables [ACC_data_xfer]

Invariant fields used in the work routine are copied in during mpas_atm_dynamics_init and deleted in mpas_atm_dynamics_finalize by building off the fields already handled in previous PRs.

gdicker1 commented 1 month ago

NOTE: the changes in this PR slightly change answers. The changes seem to come from a couple of loops. In my regional test case, if the calculations of rho_p, rho_zz, rtheta_p, exner, and pressure_p are done on the CPU instead, no differences occur.

First this loop

         do k = 1, nVertLevels
            rho_p(k,iCell) = rho_p_save(k,iCell) + rho_pp(k,iCell)
            rho_zz(k,iCell) = rho_p(k,iCell) + rho_base(k,iCell)
         end do

And in this loop

            do k = 1, nVertLevels
               rtheta_p(k,iCell) = rtheta_p_save(k,iCell) + rtheta_pp(k,iCell) &
                                 - dt * rho_zz(k,iCell) * rt_diabatic_tend(k,iCell)
               theta_m(k,iCell) = (rtheta_p(k,iCell) + rtheta_base(k,iCell))/rho_zz(k,iCell)
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do

mgduda commented 1 month ago

Although there isn't anything in particular that looks suspect to me, it does seem like the differences in results after the first timestep of a 12-km regional simulation comparing with the current develop branch are a bit larger than I would have expected (based on the differences we see before and after the merge of #1216).

I also get a core dump at the end of the simulation on Derecho when using the following modules:

module --force purge
ml ncarenv/23.09    
ml craype/2.7.23    
ml nvhpc/24.3
ml ncarcompilers/1.0.0
ml cray-mpich/8.1.27
ml parallel-netcdf/1.12.3
ml cuda/12.2.1

It may be worth taking a second look at the changes in this PR to see if we can find anything that's subtly off.

gdicker1 commented 2 weeks ago

@mgduda, this is ready for another round of review!

There are slight answer differences, but isolated just to exner and pressure_p from what I can tell. No answer differences occur if the code below is added after the loop to calculate rtheta_p, theta_m, exner, and pressure_p (near line 2892 in this PR).

         ! Calculate exner and pressure_p on CPU
         !$acc update host(rtheta_p)
         do iCell=cellStart,cellEnd
            do k = 1, nVertLevels
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do
         end do
         !$acc update device(exner,pressure_p)