ESCOMP / PUMAS

Parameterization for Unified Microphysics Across Scales
9 stars 12 forks source link

Add regression test suite for GPU-enabled PUMAS run on Casper #31

Closed sjsprecious closed 2 years ago

sjsprecious commented 2 years ago

Suggested by @johnmauff , it is necessary to add a regression test suite for the GPU-enabled PUMAS codes so that we can better maintain its GPU compatibility and recognize changes in the CAM code (less likely) or CIME code (more likely) that may break the GPU run.

Based on the discussion during the AMP SE WG meeting on 09/28, we probably will focus on the GPU test suite for the PUMAS GitHub Repo only at this moment. A test suite configuration will be added to https://github.com/PUMASDevelopment/CAM and the regression test will be done whenever a revision is made for the PUMAS code.

It is also suggested during the meeting that we should run regression test for individual CAM parameterization once it is ported to GPU. These GPU test suites could be later integrated into the standard CAM test suite when Derecho is online, which has both CPU and GPU available on the same machine.

andrewgettelman commented 2 years ago

@sjsprecious, @johnmauff indicated that currently GPU testing is broken for CAM? Is that the case? If so, maybe an issue needs to be raised in the ESCOMP/CAM repository.

I am a bit nervous about where to put the testing for GPUs (or for PUMAS in general). I'm not exactly sure the flow between ESCOMP/PUMAS, PUMASDevelopment/CAM and ESCOMP/CAM. We might want a more general solution for any parameterization?

sjsprecious commented 2 years ago

Thanks @andrewgettelman for your comment. The current GPU testing on Casper works fine for CAM. But we did realize that the testing could fail if we updated the nvhpc compiler and CUDA module version in CIME. This is one of our motivations to push the idea of regression test for the GPU run so that such changes could be caught easily.

Currently I have added a test suite for Casper in the ESCOMP/CAM (https://github.com/ESCOMP/CAM/blob/cam_development/cime_config/testdefs/testlist_cam.xml) but somehow it is not in the PUMASDevelopment/CAM yet. Not sure if this looks like a general solution to you and others.

sjsprecious commented 2 years ago

Sorry @andrewgettelman that I just realized that my previous statement was incorrect. The GPU testing was broken since PUMAS v1.17 when PPE and implicit sedimentation were introduced but they were not GPU-enabled. I am working on those codes for GPU porting now.

andrewgettelman commented 2 years ago

Hi @sjsprecious , so does that mean the issue is not CIME but just the new PUMAS code? That would be a relief to know. If it is CIME, then we definitely should figure out a way to do better testing for GPUs. But also good to figure out testing for GPUs to know if we have broken things. It might enable other PUMAS developers like me to try to implement GPU directives in the code (usually copy and paste from a similar part of the code works...)

sjsprecious commented 2 years ago

Hi @sjsprecious , so does that mean the issue is not CIME but just the new PUMAS code? That would be a relief to know. If it is CIME, then we definitely should figure out a way to do better testing for GPUs. But also good to figure out testing for GPUs to know if we have broken things. It might enable other PUMAS developers like me to try to implement GPU directives in the code (usually copy and paste from a similar part of the code works...)

Hi @andrewgettelman , yes, you are right. The current issue only comes from the new PUMAS code that is not GPU-enabled and I am working on it now.

Regarding the CIME code, we have already realized that updating nvhpc compiler to nvhpc/21.7 and using some advanced compiler flags on Casper will break the GPU test, even if we are using a GPU-enabled PUMAS code like v1.16. Currently we are working with NVIDIA colleagues to check whether a coming new nvhpc compiler would resolve this issue or not. But we definitely need to be cautious if someone updates the CIME code related to the Casper's or Derecho's (for GPU) configuration in the future.

It will be extremely helpful to maintain the GPU-enabled PUMAS code if the PUMAS developers could implement GPU directives during the code development. Feel free to reach out to us if you or others have any questions.

andrewgettelman commented 2 years ago

Hi @sjsprecious. Thanks for clarifying. I think we do need to get this test stood up for CIME updates, and probably elevate it to the CESM level.

Regarding adding code directives during development: we probably need a GPU test for PUMAS to know when we break stuff. I can try to copy directives when I add new loops or variables, but we don't yet have skill to really maintain them and know what we are doing. This will be an ongoing issue.

Not sure how we solve it. But probably need to raise it again at the AMP development meeting.

sjsprecious commented 2 years ago

A similar issue is opened at https://github.com/ESCOMP/CAM/issues/512 and a PR is opened to address this issue https://github.com/ESCOMP/CAM/pull/577.