experimentation with pragmas

So what I tried so far: In branch acc-exp https://github.com/amrvac/AGILE-experimental/tree/acc-exp , I put a gang-loop around the call to advect1_grid

    !$acc parallel loop gang private(block,{dxlevel(^D)})
    do iigrid=1,igridstail_active; igrid=igrids_active(iigrid);
       block=>ps(igrid)
       ^D&dxlevel(^D)=rnode(rpdx^D_,igrid);

      call advect1_grid(method(block%level),qdt,dtfactor,ixG^LL,idim^LIM,&
        qtC,psa(igrid),qt,psb(igrid),fC,fE,rnode(rpdx1_:rnodehi,igrid),ps(igrid)%x)

(sorry there are many debug prints in the current commit).
Thus everything called from within advect1_grid (e.g. finite_volume, reconstruction, fluxes, all the hydro update) is ran as GPU kernel and the GPU will work on several blocks in parallel, just like we did with the openMP parallelization.

Since a lot is happening downstream of advect1_grid, I guess this is a very fat kernel. I face issues when subroutines allocate temporary data and I needed to increase the heapsize to ~4GB. This fixed it for subroutine finite_volume, but it fails to allocate memory for automatic arrays now at reconstructLR which is one subroutine downstream (regardless of stack and heapsize).

While this worked very well for openMP, I'm beginning to understand that this might not be the way for GPU kernels... I might play with this a little more, but I'm not extremely optimistic.

Posting an update here. So my second try was moving the kernels to the end of the call stack, e.g.

subroutine get_Riemann_flux_hll_gpu(iws,iwe)
      integer, intent(in) :: iws,iwe
      integer :: ix1,ix2

      !$acc kernels present(fC)
      do iw=iws,iwe
         do ix2=ixCmin2,ixCmax2
         do ix1=ixCmin1,ixCmax1
         if(cminC(ix1,ix2,ii) >= zero) then
            fC(ix1,ix2,iw,idims)=fLC(ix1,ix2,iw)
         else if(cmaxC(ix1,ix2,ii) <= zero) then
            fC(ix1,ix2,iw,idims)=fRC(ix1,ix2,iw)
         else
            ! Add hll dissipation to the flux
            fC(ix1,ix2,iw,idims)=(cmaxC(ix1,ix2,ii)*fLC(ix1,ix2, iw)-cminC(ix1,&
               ix2,ii)*fRC(ix1,ix2,iw)+cminC(ix1,ix2,ii)*cmaxC(ix1,ix2,&
               ii)*(wRC(ix1,ix2,iw)-wLC(ix1,ix2,iw)))/(cmaxC(ix1,ix2,&
               ii)-cminC(ix1,ix2,ii))
         end if
         end do
         end do
      end do
      !$acc end kernels

    end subroutine get_Riemann_flux_hll_gpu

This is the hll update, which turns out to be the most costly kernel. Happy to report that now all array calculations within advect1_grid (for cada3 and hll) happen on the gpu.

There is still some data transfer happening for reasons I don't understand. According to nsight systems, this kernel from mod_limiter (the cada3 limiter)

         !$acc kernels present(rdw, dwC, tmp, ldwA, ldwB, tmp2)
          tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=dwC(ixOmin1:ixOmax1,&
             ixOmin2:ixOmax2)/(dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2) + sign(eps,&
              dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2)))
          ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=(two+tmp(ixOmin1:ixOmax1,&
             ixOmin2:ixOmax2))*third
          where(tmpeta(ixOmin1:ixOmax1,ixOmin2:ixOmax2)<=one-cadepsilon)
             rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=ldwA(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2)
          elsewhere(tmpeta(ixOmin1:ixOmax1,ixOmin2:ixOmax2)>=one+cadepsilon)
             ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2)= max(zero,&
                min(ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 max(-cadalfa*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 min(cadbeta*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2), cadgamma))))
             rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=ldwB(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2)
          elsewhere
             ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2)= max(zero,&
                min(ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 max(-cadalfa*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 min(cadbeta*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
                 ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2), cadgamma))))
             tmp2(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=(tmpeta(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2)-one)*invcadepsilon
             rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=half*( &
                (one-tmp2(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2))*ldwA(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2) +(one+tmp2(ixOmin1:ixOmax1,&
                ixOmin2:ixOmax2))*ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2))
          endwhere
          rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=rdw(ixOmin1:ixOmax1,&
             ixOmin2:ixOmax2) * dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2)
          !$acc end kernels

transfers ldwB back and forth between host and device.
Furthermore, array ranges e.g. hxOmin2 are computed on host and transferred to the gpu, not sure if that is an issue.

As an optimistic scenario I've benchmarked the 4096^2 KH test with a single block. On my Titan V I get 7.5e6 cellupdates per second. On a single Skylake 6700K core I get 1.2e6. So the GPU runs 6x faster :-) I think all things considered that's not a bad start.

amrvac / AGILE-experimental

experimentation with pragmas #12