EarthWorksOrg / EarthWorks

Other
3 stars 2 forks source link

Performance and Value of PCOLS for Earthworks GPU physics #56

Closed supreethms1809 closed 1 month ago

supreethms1809 commented 2 months ago

The drastic effect of Physics Columns (PCOLS) on GPU performance Additional details: PCOLS sets the number of columns an MPI rank processes during the run. When running multiple physics on GPUs, set PCOLS to a bigger number. Resolutions affected: all supported resolution/level combinations that use a combination of GPUs, cam_dev physics, and rrtmgp_gpu radiation. Work around: Change PCOLS using xmlchange during the setup of a case. E.g. for a case just created, use this command to request rrtmgp_gpu and set a valid PCOLS value: ./xmlchange --append CAM_CONFIG_OPTS="-rad rrtmgp_gpu -pcols 2048" NOTE: 2048 is the maximum amount of PCOLS we can use on Derecho with NVHPC. Any number greater than 2048 causes a build error. Numbers below 2048 result in worse performance in our test cases.

gdicker1 commented 2 months ago

This is labeled as "bug" since setting PCOLS too high (somewhere above 2048) causes the GPU builds to fail.

jedwards4b commented 2 months ago

You need to add flag -mcmodel=medium in ccs_config/machines/cmake_macros/

+string(APPEND FFLAGS " -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none -mcmodel=medium ") + +string(APPEND LDFLAGS "-mcmodel=medium")

For intel you add it in file intel.cmake and for gnu in file gnu.cmake. I have confirmed that this will solve the link issue for both compilers.

supreethms1809 commented 2 months ago

@jedwards4b Thanks for this information. Setting PCOLS is mainly needed for the nvhpc compiler when we are running on the GPUs. The corresponding flag for -mcmodel=medium in nvhpc compiler is -Mlarge_arrays. I tried appending this flag but I still ran into the same issue.

jedwards4b commented 2 months ago

I suspect that you did not add it in both places (FFLAGS and LDFLAGS) as I did above. I'll give it a try.

supreethms1809 commented 2 months ago

I added them in both FFLAGS and LDFLAGS. But I added them in a different way than how you showed above. May be that was the issue. Let me retry.

briandobbins commented 2 months ago

Supreeth, it looks like it's actually '-mcmodel=medium' even for NVHPC now, at least in newer versions of the compiler: https://docs.nvidia.com/hpc-sdk//compilers/hpc-compilers-user-guide/index.html#freq-used-options

I've tested this with NVHPC as well on PCOLS=4096.

supreethms1809 commented 2 months ago

@jedwards4b @briandobbins Adding -mcmodel=medium worked for me. I am able to increase the pcols and set it to the correct number needed for GPU execution. Now we only have 1 radiation call/timestep.

gdicker1 commented 1 month ago

Fixed by #59