Open oporth opened 1 month ago
So what I tried so far: In branch acc-exp https://github.com/amrvac/AGILE-experimental/tree/acc-exp , I put a gang-loop around the call to advect1_grid
!$acc parallel loop gang private(block,{dxlevel(^D)})
do iigrid=1,igridstail_active; igrid=igrids_active(iigrid);
block=>ps(igrid)
^D&dxlevel(^D)=rnode(rpdx^D_,igrid);
call advect1_grid(method(block%level),qdt,dtfactor,ixG^LL,idim^LIM,&
qtC,psa(igrid),qt,psb(igrid),fC,fE,rnode(rpdx1_:rnodehi,igrid),ps(igrid)%x)
(sorry there are many debug prints in the current commit).
Thus everything called from within advect1_grid (e.g. finite_volume, reconstruction, fluxes, all the hydro update) is ran as GPU kernel and the GPU will work on several blocks in parallel, just like we did with the openMP parallelization.
Since a lot is happening downstream of advect1_grid, I guess this is a very fat kernel. I face issues when subroutines allocate temporary data and I needed to increase the heapsize to ~4GB. This fixed it for subroutine finite_volume, but it fails to allocate memory for automatic arrays now at reconstructLR which is one subroutine downstream (regardless of stack and heapsize).
While this worked very well for openMP, I'm beginning to understand that this might not be the way for GPU kernels... I might play with this a little more, but I'm not extremely optimistic.
Posting an update here. So my second try was moving the kernels to the end of the call stack, e.g.
subroutine get_Riemann_flux_hll_gpu(iws,iwe)
integer, intent(in) :: iws,iwe
integer :: ix1,ix2
!$acc kernels present(fC)
do iw=iws,iwe
do ix2=ixCmin2,ixCmax2
do ix1=ixCmin1,ixCmax1
if(cminC(ix1,ix2,ii) >= zero) then
fC(ix1,ix2,iw,idims)=fLC(ix1,ix2,iw)
else if(cmaxC(ix1,ix2,ii) <= zero) then
fC(ix1,ix2,iw,idims)=fRC(ix1,ix2,iw)
else
! Add hll dissipation to the flux
fC(ix1,ix2,iw,idims)=(cmaxC(ix1,ix2,ii)*fLC(ix1,ix2, iw)-cminC(ix1,&
ix2,ii)*fRC(ix1,ix2,iw)+cminC(ix1,ix2,ii)*cmaxC(ix1,ix2,&
ii)*(wRC(ix1,ix2,iw)-wLC(ix1,ix2,iw)))/(cmaxC(ix1,ix2,&
ii)-cminC(ix1,ix2,ii))
end if
end do
end do
end do
!$acc end kernels
end subroutine get_Riemann_flux_hll_gpu
This is the hll update, which turns out to be the most costly kernel. Happy to report that now all array calculations within advect1_grid (for cada3 and hll) happen on the gpu.
There is still some data transfer happening for reasons I don't understand. According to nsight systems, this kernel from mod_limiter (the cada3 limiter)
!$acc kernels present(rdw, dwC, tmp, ldwA, ldwB, tmp2)
tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=dwC(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2)/(dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2) + sign(eps,&
dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2)))
ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=(two+tmp(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2))*third
where(tmpeta(ixOmin1:ixOmax1,ixOmin2:ixOmax2)<=one-cadepsilon)
rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=ldwA(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2)
elsewhere(tmpeta(ixOmin1:ixOmax1,ixOmin2:ixOmax2)>=one+cadepsilon)
ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2)= max(zero,&
min(ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
max(-cadalfa*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
min(cadbeta*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2), cadgamma))))
rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=ldwB(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2)
elsewhere
ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2)= max(zero,&
min(ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
max(-cadalfa*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
min(cadbeta*tmp(ixOmin1:ixOmax1,ixOmin2:ixOmax2),&
ldwA(ixOmin1:ixOmax1,ixOmin2:ixOmax2), cadgamma))))
tmp2(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=(tmpeta(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2)-one)*invcadepsilon
rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=half*( &
(one-tmp2(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2))*ldwA(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2) +(one+tmp2(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2))*ldwB(ixOmin1:ixOmax1,ixOmin2:ixOmax2))
endwhere
rdw(ixOmin1:ixOmax1,ixOmin2:ixOmax2)=rdw(ixOmin1:ixOmax1,&
ixOmin2:ixOmax2) * dwC(hxOmin1:hxOmax1,hxOmin2:hxOmax2)
!$acc end kernels
transfers ldwB back and forth between host and device.
Furthermore, array ranges e.g. hxOmin2 are computed on host and transferred to the gpu, not sure if that is an issue.
As an optimistic scenario I've benchmarked the 4096^2 KH test with a single block. On my Titan V I get 7.5e6 cellupdates per second. On a single Skylake 6700K core I get 1.2e6. So the GPU runs 6x faster :-) I think all things considered that's not a bad start.
Add acc pragmas to KH test case and profile.