Inconsistent RRTMGP results on CPU with different code paths

sjsprecious commented 1 year ago

Hi,

I was running RRTMGP (tag: v1.5) in CAM on Cheyenne (a supercomputer at NCAR, CPU nodes only) and I observed different results when I used different RRTMGP code paths for compilation (the same intel compiler, compset, grid, etc for other settings).

To be specific, case 1 used the following code paths from this repo:

rrtmgp/kernels
rte/kernels

while case 2 used the following code paths from this repo:

rrtmgp/kernels-openacc
rte/kernels-openacc

I wonder whether the difference in the CPU results for these two cases is expected or not. Or is there any compiler flag that I should include in my CAM build?

RobertPincus commented 1 year ago

@sjsprecious The kernels and the kernels-openacc are independent implementations. They are not guaranteed to be bitwise reproducible but are consistent to roughly one part in 10^-5 (see the various compare-to-reference.py scripts). This will naturally lead to simulations diverging modestly over time.

There is no advantage to using the OpenACC kernels if you're running on CPUs.

Please let me know if this answers your question.

sjsprecious commented 1 year ago

Hi @RobertPincus , thanks for your quick reply. I have run CAM+RRTMGP on GPU using OpenACC but the GPU result differs a lot from the CPU result. I tried to find out whether the issue comes from the code itself or the settings on my end. This is why I performed two CPU results described here to evaluate the difference of source code. But your explanation is very helpful and clear to me.

In addition, I am not applying any suggested compiler flags from the Compiler-flags.md file to my CAM build yet. Are they intended for correctness or efficiency or other concerns? Thanks.

RobertPincus commented 1 year ago

@sjsprecious The RTE+RRTMGP library is tested each time it's modified using clear-sky and all-sky examples. The answers to these problems differ from reference results in a relative sense by less than 10^-5 - that applies both the CPU and GPU implementations run on CPUs, and for the GPU implementations run on GPUs.

If your instantaneous, single-time step results differ using two configurations by more than a very small amount for some particular implementation we would want to understand this.

The compiler flags are suggestions, not prescriptions.

sjsprecious commented 1 year ago

Thanks @RobertPincus for the additional information.

Here are the details about my simulation:

Model: CAM + RRTMGP
Compset: F2000climo
Grid: ne30
Compiler: intel/nvhpc for CPU and nvhpc for GPU
Simulation length: 2 days

By the end of the 2-day simulation, the global mean heating rate is 0.27041736993343682E-05 from CPU simulation (intel compiler) and 0.22976834120723802E-05 from GPU simulation.

If I ran the CPU simulation with nvhpc compiler, the global mean heating rate is 0.63468821552244069E-05. If I ran the CPU simulation with intel compiler and the code under kernels-openacc, the global mean heating rate is 0.38185042809439744E-05.

All the relative difference here is much higher than 1e-5.

Thank you and let me know what you think or if you have any further questions.

RobertPincus commented 1 year ago

@sjsprecious The 10^-5 threshold I'm referring to applies to instantaneous differences given the same atmospheric state. My interpretation of your results is that the heating rate is quite sensitivity to the precise configuration because the states diverge even after two days of simulation. The differences between the same compiler and the two code paths are .381/.270 - 1 or about 40%; the difference between Nvidia and Intel compilers is 634/.270 - 1 or 135%. In other words, changing compilers has an impact of roughly the same order of magnitude as changing code paths.

Do you have a different interpretation?

sjsprecious commented 1 year ago

Thanks @RobertPincus for your additional clarifications and detailed analyses. It is very helpful and your interpretation makes sense to me, though I am still surprised by the big difference caused by switching compilers on the CPU end.

The reason that I am concerned about the difference here is that my GPU simulation fails the CAM ensemble consistency test (https://gmd.copernicus.org/articles/11/697/2018/gmd-11-697-2018.pdf), which means that changing the platform (from CPU to GPU) and compiler (from intel to nvhpc) is likely to cause a statistically significant difference in the climatology. This did not happen when I ran other components on GPU in CAM (like the cloud microphysics, PUMAS). Therefore, I would like to make sure that I am using the RRTMGP code in a correct way and aware of any special settings that should be included in the CAM build.

Before I perform a climatological run for further diagnostic, I wonder if RRTMGP code has any implementation that could be compiler or platform dependent, such as random number generator, complex number calculation, etc? Thanks.

RobertPincus commented 1 year ago

@sjsprecious Is it possible to compute fluxes and heating rates in CESM on two platforms (e.g. Intel/CPU and Nvidia/GPU) for single time step in which the conditions are guaranteed to be identical, i.e. without any time integration? We expect these to be quite similar (i.e. fluxes to within 10^-5 or so) but there's a lot in the CESM implementation that we don't test as part of RTE+RRTMGP.

sjsprecious commented 1 year ago

Thanks @RobertPincus for your suggestion. I am not sure if it is possible to compute those quantities on different platforms but with identical conditions. I will ask around for more information.

sjsprecious commented 1 year ago

Hi @RobertPincus , after taking a closer look at the source code, I just realized that the GPU code added a few new if statements which did not exist in the corresponding CPU code.

For example, compare the delta_scale_2str_f_k subroutine in the GPU code (https://github.com/earth-system-radiation/rte-rrtmgp/blob/main/rte/kernels-openacc/mo_optical_props_kernels.F90#L73) and the CPU code (https://github.com/earth-system-radiation/rte-rrtmgp/blob/main/rte/kernels/mo_optical_props_kernels.F90#L62).

I think these if statements will contribute to the divergence of the answers between the CPU and GPU runs to some extent. Is there a reason that these if statements are only added to the GPU code? Would it be better to also add them in the CPU code to make both code more consistent?

RobertPincus commented 1 year ago

In offline discussions with @sjsprecious we've identified one possibility for why the CESM ensembles diverge between CPU and GPU implementations, namely that random number seeds used in McICA treatment of clouds are determined by pressure, making the implementation very stiff (quick to diverge). @sjsprecious is checking whether this can reconcile the issue.

Meanwhile he notes the following exhaustive list of differences between CPU and GPU implementations:

Masked assignment vs. use of a minimum to avoid divide-by-zero: rte/kernels/mo_optical_props_kernels.F90 vs. rte/kernels-openacc/mo_optical_props_kernels.F90 delta_scale_2str_f_k delta_scale_2str_k increment_2stream_by_1scalar increment_2stream_by_2stream increment_2stream_by_nstream increment_nstream_by_1scalar increment_nstream_by_2stream increment_nstream_by_nstream inc_1scalar_by_1scalar_bybnd inc_1scalar_by_2stream_bybnd
inc_1scalar_by_nstream_bybnd inc_2stream_by_1scalar_bybnd
inc_2stream_by_2stream_bybnd inc_2stream_by_nstream_bybnd inc_nstream_by_1scalar_bybnd inc_nstream_by_2stream_bybnd inc_nstream_by_nstream_bybnd

In addition, I have two clarification questions, which do not necessarily mean the code is wrong but are helpful for me to better understand the radiation calculations: In the source code there is an OpenACC directive for the variable top_at_1 but I did not see a data copy directive for the variable play. Is it done somewhere else in the code or do you expect the compiler to do it automatically? Does play change along the simulation?

Regarding the compute_Planck_source subroutine, the CPU version has a loop over nbnd but the GPU version does not. The overall subroutine structure also looks different. Are these changes expected?

sjsprecious commented 1 year ago

Hi @RobertPincus , with the help of Brian Medeiros and Brian Eaton at NCAR, we were able to calculate the random number seeds used in McICA treatment independent of any state variables such as pressure. We performed a new ensemble consistency test with the revised code but unfortunately the GPU runs still failed the test.

Brian Eaton is helping me set up the PORT driver in CAM for RRTMGP, which allows us to compare the instantaneous CPU/GPU results that are generated from identical inputs (what you proposed early in this thread). I will post the PORT results here once they are ready.

RobertPincus commented 1 year ago

Thanks for the update @sjsprecious

sjsprecious commented 1 year ago

Hi @RobertPincus , Brian Eaton has helped me set up the PORT simulations for CAM+RRTMGP. Below is a summary of what we have found out:

Using the same RRTMGP CPU code base, switching compiler (e.g., intel vs. nvhpc) leads to roundoff level difference.
Using the same compiler (e.g., nvhpc), running RRTMGP CPU code on the CPU differs largely from running RRTMGP GPU code on the CPU, especially for the SW fields. The LW fields are consistent.
Using the same compiler (e.g., nvhpc), running RRTMGP CPU code on the CPU differs largely from running RRTMGP GPU code on the GPU, especially for the SW fields. The LW fields only show roundoff level difference.

Therefore, based on the results above, we think the SW calculations in the GPU code are probably problematic compared to its corresponding CPU version. It will be great if you could run the GPU code on the CPU in the rte-rrtmgp unit test and see whether you observe a similar behavior as well. Thank you and let me know if you have any questions.

brian-eaton commented 1 year ago

@RobertPincus, reading back through this issue I notice you stated that the GPU implementation run on CPUs does pass your unit testing. Are you testing with nvhpc? And if so, what's the compiler version?

RobertPincus commented 1 year ago

@brian-eaton I was going to mention this. With the introduction of single-precision to the develop branch in commit e51d10ad we are now focusing CI on the configurations we expect people to use, meaning that the GPU kernels are no longer being tested routinely on CPUs. But until quite recently the CI was indeed testing the GPU kernels on the CPU with nvhpc (here's the most recent example of passing results). On the CPUs nvhpc is run in a container, and we update the containers to use the most recent release whenever we notice it.

Note also that we test the GPU kernels on GPUs using nvfortran at the Swiss Supercomputing Center (example) and that these implementations are held to the same level of agreement with reference results as any other implementation (typically 1e-5 W/m2 for fluxes).

In other words the results from @sjsprecious surprise me because they're inconsistent with our continuous testing. Your tests are no doubt more wide-ranging - do results from the GPU and CPU kernels diverge everywhere, or for some particular set of conditions? If the latter it would be great if you could introduce a simple test that fails in a PR; I'd make changes until that test passed.

brian-eaton commented 1 year ago

Thanks for the feedback @RobertPincus. Our testing has been at the system level using PORT to do offline radiation calculations on a 2-deg global grid driven (I believe) by realistic atm conditions. We so far have focused on the difference statistics of 2D and 3D output fields for fluxes and heating rates without trying to pick out columns containing the worst results. That's something we can do, but it probably makes sense first to verify whether the unit tests fail on our platform (casper) using the GPU code version on CPUs. We'll keep you updated.

RobertPincus commented 1 year ago

@brian-eaton The CI is simply make libs tests check in the top-level directory...

brian-eaton commented 1 year ago

Hi @RobertPincus. All the current unit tests run with both CPU and GPU kernels pass on our platforms, as advertised. However, I have found that the differences between the CPU and GPU kernels can be demonstrated by the following simple change in examples/rrtmgp_allsky.F90. In the code version v1.6 at line 262 replace

 sfc_alb_dir = 0.06_wp

by

 do icol = 1, ncol
    sfc_alb_dir(:,icol) = 0.06_wp + icol * 0.8_wp/ncol
 end do

This provides direct surface albedo values that vary between 0.06 and 0.8 among the test columns. When I run the rrtmgp_allsky test with the CPU kernels, and then again with the GPU kernels, and use the compare-to-reference.py script with the --ref_dir argument to compare the two output files, I get the following output:

Variable lw_flux_up: No diffs Variable lw_flux_dn: No diffs Variable sw_flux_up differs (max abs difference: 6.532832e+02; max percent difference: 1.721535e+02%) Variable sw_flux_dn differs (max abs difference: 3.582726e+01; max percent difference: 3.987433e+00%) Variable sw_flux_dir: No diffs

I ran both CPU and GPU kernels on my linux desktop using cpus and gnu compiler. Let me know if there's more I can do to help.

RobertPincus commented 1 year ago

@brian-eaton Thanks very much for this careful sleuthing. It pointed directly to an error in the code which I'll fix (though likely not until September when I'm back from holiday). CESM might be the only instance I've heard of in which the direct and diffuse surface albedos are different! It's great to know that someone is taking advantage of this.

In case you're curious the error is in passing sfc_alb_dif to sw_dif_and_source() in the GPU kernels (the argument should be sfc_alb_dir)

brian-eaton commented 1 year ago

Thanks @RobertPincus. I am now getting bit-for-bit identical results from the CPU and GPU kernels when both are run on CPUs. These results are from using CAM's PORT functionality to do offline radiation calculations.

RobertPincus commented 1 year ago

The main message I take away is that direct and diffuse surface albedos are not covered by tests in the current framework - that'd be worth doing.

sjsprecious commented 1 year ago

Thanks @brian-eaton and @RobertPincus for sorting this issue out. I could confirm that with Robert's fix, the RRTMGP GPU code now passes the ensemble consistency test. We just need a new tag for the fix.

sjsprecious commented 1 year ago

Fixed via #225

earth-system-radiation / rte-rrtmgp

Inconsistent RRTMGP results on CPU with different code paths #211