Optimize the maximum and minimum operations on the GPU

earth-system-radiation / rte-rrtmgp

RTE+RRTMGP is a set of codes for computing radiative fluxes in planetary atmospheres.

BSD 3-Clause "New" or "Revised" License

74 stars 65 forks source link

Optimize the maximum and minimum operations on the GPU #269

Closed sjsprecious closed 5 months ago

sjsprecious commented 5 months ago

According to my tests of CAM+RRTMGP on NVIDIA A100 GPU, the OpenACC code runs slowly due to the current implementation of maximum and minimum operations in https://github.com/earth-system-radiation/rte-rrtmgp/blob/develop/rte-frontend/mo_rte_util_array_validation.F90 (see an example below):

!$acc kernels copyin(array)
minValue = minval(array)
maxValue = maxval(array)
!$acc end kernels

Nsight profiling result shows that the implementation above leads to one threadblock and one thread per threadblock only on the GPU, which basically runs in serial and is not good for GPU performance.

These max and min operations could be revised into a parallelized version that utilizes the GPU resource more efficiently. These changes improve the performance of my new CAM+RRTMGP GPU runs.

I did not test the OpenMP offload code in CAM+RRTMGP though and just revised them by copying the directives from similar existing examples.

RobertPincus commented 5 months ago

Paging @alexeedm ... any thoughts about this?

alexeedm commented 5 months ago

@sjsprecious I think for the production runs you anyway want to turn off all the verification. Set these values to .false. before calling any RTE-RRTMGP routines: https://github.com/earth-system-radiation/rte-rrtmgp/blob/1e087216d3d31f608e33b17491e2334ede8c5d5a/rte-frontend/mo_rte_config.F90#L25-L26

I'll have to check why our compiler doesn't produce good kernels in this case, it might be because two function calls are combined in a single kernels region. We'll have to fix it in the compiler

Robert, in case you want to just switch to explicit loops like suggested, we could simply remove ifdef _OPENMP and add OpenACC pragmas on top of OMP ones. Right now ifdef blocks in the PR contain exactly the same code, which doesn't seem ideal :)

sjsprecious commented 5 months ago

Hi @alexeedm , thanks for your clarification. If those subroutines are only called for verification purpose and should not be turned on during a production run, I am happy to do what you suggest and leave the code in the current format. @RobertPincus , could you confirm this?

In addition, since I am not familiar with the RTE-RRTMGP code, can you or Robert tell me what other settings should be turned on/off for a production run?

RobertPincus commented 5 months ago

@sjsprecious Dmitry is suggesting to call rte_config_checks(.false.) for production runs, which isn't a bad idea and makes the code changes moot. It would mean that array extents would not be checked at run time, nor would values be checked for correctness.

If not and you want to proceed with the PR, please revise it as @alexeedm suggests to stack the OpenMP and OpenACC directives and avoid the IFDEF

sjsprecious commented 5 months ago

Hi @RobertPincus , thanks for your quick reply and explanation. In this case, I am fine to go ahead with the current code and do what you and Dmitry suggested. I now do not see a strong motivation to proceed with this PR if there will be a compiler fix in the future.

Could you advise if there are any other settings I should turn on/off in RTE-RRTMGP for a production run?

RobertPincus commented 5 months ago

@sjsprecious Perhaps @alexeedm has thoughts but there aren't other obvious changes to make on the RTE or RRTMGP side. Note that GEOS5 sees some advantage to running in single precision.

I expect you will find that managing the memory in the code that couples CESM to RTE+RRTMGP to ensure that all calculations occur on the GPU with transferring fields back and forth to the CPU between steps will improve your performance by a lot.

If you don't want to pursue the PR please mark this as closed.

sjsprecious commented 5 months ago

Thanks @RobertPincus for your suggestions. Managing the data transfer in the CESM interface to RTE+RRTMGP is exactly what we plan to do next and hope we could see a GPU performance boost later.

I will close this PR as suggested but @alexeedm feels free to leave what other suggestions you have in mind.