E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
78 stars 56 forks source link

Build error in rrmtgp when trying to use newer cudatoolkit 12.4 #2953

Open ndkeen opened 2 months ago

ndkeen commented 2 months ago

We currently use cuda 12.2 on pm-gpu. When trying with 12.4, I see this build error:

/global/cfs/cdirs/e3sm/ndk/repos/ndk_scream_hackathon2024-pm-gpu/components/eamxx/../eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas_optics_rrtmgp.h(1146): error: identifier "rrtmgp_constants<double> ::m_dry" is undefined in device code

/global/cfs/cdirs/e3sm/ndk/repos/ndk_scream_hackathon2024-pm-gpu/components/eamxx/../eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas_optics_rrtmgp.h(1132): error: identifier "rrtmgp_constants<double> ::grav" is undefined in device code

2 errors detected in the compilation of "/global/cfs/cdirs/e3sm/ndk/repos/ndk_scream_hackathon2024-pm-gpu/components/eam/src/physics/rrtmgp/external/cpp/examples/mo_load_coefficients.cpp".
mahf708 commented 2 months ago

@ndkeen, here's a fix for this (the YAKL_SCOPE thing isn't working for some reason ... :/)

diff --git a/cpp/rrtmgp/mo_gas_optics_rrtmgp.h b/cpp/rrtmgp/mo_gas_optics_rrtmgp.h
index 0768d44..e98eea7 100644
--- a/cpp/rrtmgp/mo_gas_optics_rrtmgp.h
+++ b/cpp/rrtmgp/mo_gas_optics_rrtmgp.h
@@ -1127,7 +1127,7 @@ public:
       });
     } else {
       // do icol = 1, ncol
-      YAKL_SCOPE( grav , const_t::grav );
+      const auto grav = const_t::grav;
       parallel_for( YAKL_AUTO_LABEL() , SimpleBounds<1>(ncol) , YAKL_LAMBDA (int icol) {
         g0(icol) = grav;
       });
@@ -1136,9 +1136,9 @@ public:
     real2d col_dry("col_dry",size(plev,1),size(plev,2)-1);
     // do ilev = 1, nlev-1
     //   do icol = 1, ncol
-    YAKL_SCOPE( m_dry , const_t::m_dry );
-    YAKL_SCOPE( m_h2o , const_t::m_h2o );
-    YAKL_SCOPE( avogad , const_t::avogad );
+    const auto m_dry = const_t::m_dry;
+    const auto m_h2o = const_t::m_h2o;
+    const auto avogad = const_t::avogad;
     parallel_for( YAKL_AUTO_LABEL() , SimpleBounds<2>(nlev-1,ncol) , YAKL_LAMBDA (int ilev , int icol) {
       real delta_plev = std::abs(plev(icol,ilev) - plev(icol,ilev+1));
       // Get average mass of moist air per mole of moist air
mahf708 commented 2 months ago

~Alternatively, you could just switch from YAKL to KOKKOS (since we are not really interested in profiling radiation at this stage, right?) by setting RRTMGP_ENABLE_KOKKOS (i.e., -DSCREAM_RRTMGP_ENABLE_KOKKOS='TRUE' -DSCREAM_RRTMGP_ENABLE_YAKL='FALSE' ???)~ Nope this is dysfunctional

mahf708 commented 2 months ago

PR #2954 fixes RRTMGP_ENABLE_KOKKOS...

ndkeen commented 2 months ago

I finally verified this work-around not only avoids this build error, but allows the case to create exe and run. So this may be only issue to using cuda 12.4 (found so far).

mrnorman commented 1 month ago

Agreed about the eventual kokkos target. Also, CUDA's stochastic inability to capture locally scoped variables by value in lambdas is infuriating. This seems like a regression for their compiler.

ndkeen commented 1 month ago

How best to get this into scream? It looks like components/eam/src/physics/rrtmgp/external is part of submodule? So I don't think I can just make a branch and issue PR?

mahf708 commented 1 month ago

The specific fix will have to be in yakl, so mrnorman/yakl repo :)

Do you recommend a PR there, @mrnorman? if so, you do you have a preference for how to rework yakl_scope?

mahf708 commented 4 weeks ago

done in https://github.com/E3SM-Project/rte-rrtmgp/pull/38

ndkeen commented 2 weeks ago

I still see the same build error

ndkeen commented 1 day ago

This is still a build error.

bartgol commented 1 day ago

How best to get this into scream? It looks like components/eam/src/physics/rrtmgp/external is part of submodule? So I don't think I can just make a branch and issue PR?

Perhaps @jgfouca has some thoughts, given that he did some work on rrtmgp lately.

mahf708 commented 1 day ago

We just need to update the submod. I ~think i~t is part of the rrtmgp-k switch pr https://github.com/E3SM-Project/scream/pull/3030/files