MPAS-Dev / MPAS-Model

Repository for MPAS models and shared framework releases.
238 stars 317 forks source link

Enable GPU exection of atm_rk_integration_setup via OpenACC #1223

Open abishekg7 opened 1 month ago

abishekg7 commented 1 month ago

This PR enables the GPU execution of atm_rk_integration_setup subroutine

An initial OpenACC port of the array assignments in atm_rk_integration_setup. Needed to split up ACC loops with gang vector collapse(2) into separate ACC loop statements to achieve correct results.

gdicker1 commented 5 days ago

@abishekg7 I tried the changes in this PR, everything works as expected. Though I wonder if we could still use collapse statements?

I think mixing the levels of parallelism and the collapse statements causes issues. I was able to get things to reproduce results by just using acc loop collapse(2) (collapse(3) in one place). From the best practices guide I also think it might be fine to do something like acc loop gang collapse(2) OR acc loop vector collapse(2). My preference is not to state the level of parallelism so the compiler gets it right for me.

mgduda commented 1 day ago

@gdicker1 I don't quite think I follow your comment. If my understanding is correct, we found that collapsing loops with different levels of parallelism leads to incorrect results, and the only place where we could collapse vector loops already has a collapse(2) clause. In which instances, specifically, would you suggest not explicitly stating the level of parallelism? Perhaps I'm distrustful of compilers, but it seems to me that being explicit about our intent as developers should lead to better outcomes than hoping the compiler will make good decisions.

mgduda commented 1 day ago

I think it might be worth taking a fresh look at the commit message

    initial OpenACC port of atm_rk_integration_setup

    - splitting up the loop gang vector collapse(2) into two separate loops, as it leads
      to erroneous results otherwise

and PR description. The text about splitting up the loop might not make sense to anyone who doesn't know the history of the porting of the atm_rk_integration_setup routine, and so that remark could either be omitted or rewritten. Perhaps we could consider it a true unless stated otherwise, but even still it might also be good to state in the commit message that there are no changes to results.

gdicker1 commented 1 day ago

@gdicker1 I don't quite think I follow your comment...

@mgduda, I agree that being proscriptive with the parallelism should be best practice.

With what I'd call a fully collapse-able loop (e.g. a 2-level loop, with code only in the innermost loop body, and not inside another loop) I am simply unsure what level of parallelism is assigned by the compiler when collapsed. So, that's where my preference to let "the compiler get it right for me" comes from - and it applies just to the fully collapse-able loops.

I think the following example would be fine:

      !$acc loop collapse(2)
      do iEdge = edgeStart,edgeEnd
         do k = 1,nVertLevels
            ru_save(k,iEdge) = ru(k,iEdge)
            u_2(k,iEdge) = u_1(k,iEdge)
         end do
      end do

( I also would have expected !$acc loop gang vector collapse(2) to be correct before this PR, since I think it matches with the build output I got for the code example above: "1663, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x" )