E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 54 forks source link

Radiation is too slow on CPU in cime cases #1585

Open ndkeen opened 2 years ago

ndkeen commented 2 years ago

Noting this is a general issue. It's not clear if this is simply a performance issue. There may be something else wrong (and may only be on CPU). Additionally, it does make it a little more difficult to test on cori as the model runs so slow. Can provide details, but the slowness is ~20x or more what we might expect (radiation is ~90% of the total time). The radiation time does respond to using more MPI's and there is a known issue with using threads (only using 1 in these cases). Verified still see this with master of April 27th.

bartgol commented 2 years ago

Are you saying that running with N and 2N ranks yields the same exec time (for RRTMGP)?

ndkeen commented 2 years ago

No, the opposite -- using more MPI's speeds up radiation. But it's still too slow.

bartgol commented 2 years ago

Ok, that is at least not messed up.

@brhillman might have some thoughts about rad performance.

ndkeen commented 2 years ago

Adding some timers into our radiation interface, I see that a majority of the time spent in both SW and LW is in gas_optics call.

in rrtmgp_sw:
54% in gas_optics
28% in rte_sw

in rrtmgp_lw:
70% in gas_optics
15% in rte_lw
cori08% grep gas_optics components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp
#include "cpp/rrtmgp/mo_gas_optics_rrtmgp.h"
            k_dist.gas_optics(nday, nlay, top_at_1, p_lay_day, p_lev_day, t_lay_day, gas_concs_day, optics, toa_flux);
            k_dist.gas_optics(ncol, nlay, top_at_1, p_lay, p_lev, t_lay, t_sfc, gas_concs, optics, lw_sources, real2d(), t_lev);

And if the routines being called are located here:

cori08% grep 'void gas_optics' components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas_optics_rrtmgp.h
  void gas_optics(const int ncol, const int nlay,
  void gas_optics(const int ncol, const int nlay,

I need to figure out how to add timers here.

Are these just where time would be spent in YAKL kernels?

bartgol commented 2 years ago

@brhillman Naive question: is it possible RRTMGP is not using the memory pool, and doing alloc/free every time it needs a temporary?

PeterCaldwell commented 2 years ago

Perhaps the next step is for Ben or Noel to ask Matt Norman to comment on the situation?

mrnorman commented 2 years ago

@ndkeen , feel free to tag me in YAKL-related issues at the start. To turn on OpenMP CPU threading in YAKL when you want loop-level OpenMP inside the individual components, you need -DYAKL_ARCH="OPENMP" -DYAKL_OPENMP_FLAGS="-fopenmp -O3 ...". If you've had trouble with that, please let me know the specific trouble you're having along with a reproducer, and I'll look into it. For more info on the build process in general, please see:

https://github.com/mrnorman/YAKL/wiki/Using-and-Compiling-with-YAKL

If you find places where the documentation can be improved, please let me know, and I'm happy to improve it.

For non-threaded cases, you need to make sure -DYAKL_CXX_FLAGS="-O3 ..." is set; otherwise optimizations aren't turned on for YAKL-compiled files.

Can you put the verbose make output lines from your scream build for RRTMGP C++ files so we can make sure the flags are appropriate? Also, the CMake configure will tell you upon add_subdirectory(yakl) which flags YAKL is applying to YAKL files as a CMake status message.

Even without optimizations, a 20x slowdown is pretty severe without threading coming into play.

Please let me know if you have other questions.

mrnorman commented 2 years ago

@bartgol regarding the memory pool, when the memory pool is in use, "Create pool" is set to stdout for every pool creation. It is on by default unless the user sets export GATOR_DISABLE=1.

mrnorman commented 2 years ago

Regarding hooking up YAKL's timers to the existing E3SM GPTL, please see the following documentation:

https://github.com/mrnorman/YAKL/wiki/YAKL-Timers

With this, you can get finer timers within the RRTMGP library. If you have trouble with this, please let me know, and I'll help get it working.

ndkeen commented 2 years ago

This is all without threads.

Hmm, OK, might this be an issue:

-- ** YAKL_ARCH not set. Building YAKL for a serial CPU backend **
-- ** YAKL is using the following C++ flags:  **

or maybe thats ok

./components/cmake/build_model.cmake:70:      # YAKL_ARCH can be CUDA, HIP, SYCL, OPENMP45, or empty
./components/cmake/build_model.cmake:73:        set(YAKL_ARCH "CUDA")
./components/cmake/build_model.cmake:79:        set(YAKL_ARCH "")

We use CUDA for gpu runs, but should be empty for CPU? I guess serial backend means MPI-only.

Here are typical flags for yakl f90 file i think:

cd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/bld/cmake-bld/externals/YAKL
&& python3
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/Tools/e3sm_compile_wrap.py
/opt/cray/pe/craype/2.7.15/bin/ftn -DMPICH_SKIP_MPICXX
-DSCREAM_CONFIG_IS_CMAKE
-I/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src
-fallow-argument-mismatch -g -cpp -ffree-line-length-none -Wall -O3
-DNDEBUG -O3 -O3 -w -c
/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src/YAKL_gator_mod.F90
-o CMakeFiles/yakl.dir/src/YAKL_gator_mod.F90.o
cd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/bld/cmake-bld/externals/YAKL
&& python3
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/Tools/e3sm_compile_wrap.py
/opt/cray/pe/craype/2.7.15/bin/CC -DMPICH_SKIP_MPICXX
-DSCREAM_CONFIG_IS_CMAKE
-I/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src
-DTHRUST_IGNORE_CUB_VERSION_CHECK -g1 -Wall -O3 -DNDEBUG -O3
-fopenmp-simd -w -std=gnu++14 -MD -MT
externals/YAKL/CMakeFiles/yakl.dir/src/YAKL.cpp.o -MF
CMakeFiles/yakl.dir/src/YAKL.cpp.o.d -o
CMakeFiles/yakl.dir/src/YAKL.cpp.o -c
/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src/YAKL.cpp
ndkeen commented 2 years ago

I'm chatting with Matt on slack, but just wanted to note here that I did add some quick/temporary yakl timers to dive down a little more. If most of the slow time is in gas_optics, for this ne30 case that's about 33 seconds. Of that, 26.4 seconds are in compute_tau_absorption and 6.3 sec in source. Within compute_tau_absorption, almost all time in gas_optical_depths_major and gas_optical_depths_minor (which are just large pfors).

Matt asked for directions to reproduce: To test, clone scream git clone git@github.com:E3SM-Project/scream.git, and:

cd scream
cd cime/scripts
create_test SMS_Ln24_P64x1.ne30_ne30.F2000SCREAMv1

where compiler should be such that it will be built for CPU (depending on machine). Running for 24 steps, but I'm sure only need 2 or 4. The 64x1 should ensure no threads.

I don't yet have more detailed timers in the code, and could make a branch with them if it helps.

On just the AMD of Perlmutter, that runs in 6 minutes (shorter with fewer steps)

bartgol commented 2 years ago
-- ** YAKL_ARCH not set. Building YAKL for a serial CPU backend **
-- ** YAKL is using the following C++ flags:  **

Does this mean C++ files for YAKL are built without any code optimization (that is, as if -O0 was passed)? It might be worth checking if rrtmgp timers are the same as in DEBUG=ON. If that's the case, maybe the issue is that we are not getting opt flags to YAKL?

Edit: nvm, bld/cmake-bld/externals/YAKL/CMakeFiles/yakl.dir/flags.make shows the correct CXX flags.

mrnorman commented 2 years ago

His verbose make output shows -O3, which makes me think that scream is setting CXXFLAGS to pass flags into CMake for C++ source files. That's common practice for many Kokkos codes, I think, so not too surprising. I am a bit concerned about the -g. For GNU, does that turn off opts?

Edit: OK, so maybe it's the flags.make rather than CXXFLAGS.

bartgol commented 2 years ago

It should not. -g should only add debug symbols to the exe file, but I don't think it has any implication on the code optimizations (which is why CMake adds both -g and -O0 for debug builds).

mrnorman commented 2 years ago

Even opt flags wouldn't account for a 20x slowdown, though. I think it's something else that's up. I'll work on Noel's reproducer on Summit tomorrow and dig in further.

bartgol commented 2 years ago

Thanks Matt!

mrnorman commented 2 years ago

@whannah1 volunteered to run a simple CPU compset on Summit and Cori to see if we see a similar slowdown in E3SM. I haven't gotten to running create_test SMS_Ln24_P64x1.ne30_ne30.F2000SCREAMv1 on Summit. It's down today, but hopefully I can do it tomorrow. I don't have access to Cori, so I cannot reproduce the scream situation there.

@ndkeen , your 20x slowdown for rrtmgpxx on cori, was that with KNL or a more traditional CPU? Also, have you tried compiling with the Intel compiler? It's not clear what version of GNU you're using, but older versions do not vectorize C++ very well. Intel always seems to be faster across the board for me.

I'm hoping to give more information soon.

ndkeen commented 2 years ago

I see the slowdown on any CPU I've tried -- which is only cori-knl and pm-cpu (cpu-only nodes of Perlmutter, which are AMD's). On Cori, I've tried with Intel/GNU (v9). And with PM, it's with GNU (v10 and v11).

mrnorman commented 2 years ago

Noel, in the meantime, can you e-mail me or slack me a GPTL timer file to look at that shows this problem?

mrnorman commented 2 years ago

Working past my typical time, but I was able to reproduce a slowdown in standalone rte-rrtmgp. There are two things at play here:

  1. The Fortran code you're probably running is the original Fortran code, not the OpenACC. YAKL was patterned after the OpenACC code. The OpenACC Fortran code runs 2.25x slower than the original Fortran code.
  2. After zeroing in on the most expensive components, I'm fairly certain the slowdown is occurring in the interpolate2D and interpolate3D routines and that it is due to overheads in YAKL's array slicing. Fortran can probably short-circuit a lot of that arithmetic, but I'm not sure if C++ can.

Currently rrtmgpxx is about 3x slower in standalone than OpenACC rrtmgp in Fortran. Neither I nor @whannah1 have seen a 20x slowdown so far.

My plan is to inline the interpolation routines into the routines that call them in order to avoid the need for slicing. Hopefully that will bring the performance back up to something close to the Fortran OpenACC code.

ndkeen commented 2 years ago

Hmm, well that may point to a different issue. As I'm seeing a lot of time in gas_optical_depths_major and gas_optical_depths_minor.

I can send timing info, but note we don't currently have many timers in the code. The ones I've added are temporary with names that wouldn't make sense unless you had my local changes (which you are also happy to see). And finally, the YAKL timers I added are not in the timing files at all as they are not yet hooked up to GPTL. So they are just in e3sm.log.

mrnorman commented 2 years ago

gas_optical_depths_major and gas_optical_depths_minor make many calls to interpolate3D and interpolate2D, respectively, so that corroborates well. I don't think I'll need any timers from here. Hopefully my hunch is right and it's an easy fix.

mrnorman commented 2 years ago

https://github.com/E3SM-Project/rte-rrtmgp/pull/21 fixes the performance for rte-rrtmgp. After inlining interpolate2D and interpolate3D, standalone tests indicate the C++ LW is now 17% more expensive than the OpenACC Fortran code and the C++ SW is now 38% more expensive than the OpenACC Fortran. This is with GNU 9.x on my laptop, and I think it will give similar results in other contexts. I'll look into SW performance later, but for now, this should be a good improvement. This PR will require the latest YAKL main branch.

ndkeen commented 2 years ago

I updated YAKL to main and updated radiation to use Matt's branch above. I ran a f30 case on cori-knl with GNU & Intel before this change as well as GNU & Intel after this change. I can show raw data, but this change is running radiation about 1.4x faster than before. However, the radiation is still WAY too slow. If I can just compare the speed of the radiation tot he speed of the dynamics, we would probably expect radiation to faster than dynamics. But after this change, radiation is about 30x slower than dynamics (for this case on this machine).

This is my local repo location to verify the changes have been included /global/cfs/cdirs/e3sm/ndk/newyakl-se03-may6

mrnorman commented 2 years ago

Hmm, that's not in line with standalone results. I don't see a scream branch or fork with your github username. Can you push the branch you created to run that test and mention it here?

ndkeen commented 2 years ago

I will have to create one. But note you can run essentially same thing with something like create_test SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1 (compset name has changed in recent scream -- and of course set MPI's to be whatever will work best on machine of choice)

mrnorman commented 2 years ago

I'm not sure scream is setup on Summit at the moment:

[imn@login5:~/scream/cime/scripts] 8-) ./create_test SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1.summit_gnu
...
    Case dir: /gpfs/alpine/cli115/proj-shared/imn/e3sm_scratch/SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1.summit_gnu.20220523_161354_ku9cwj
    Errors were:
        WARNING: Found directory /ccs/home/imn/.cime but no cmake macros within, set env variable CIME_NO_CMAKE_MACRO to use deprecated config_compilers method
        ERROR: Error! Could not find version 5.1 for package yaml.

I'll need to look into how to change the modules.

whannah1 commented 2 years ago

@mrnorman just do a "module load python" and it will build, but there's other problems that prevents it from running on Summit at the moment. see: https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3386015745/How+To+Run+SCREAMv1

mrnorman commented 2 years ago

Scream seems to be looking for CUDA even with a "gnu" build.

CMake Error at /autofs/nccs-svm1_sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/cmake-3.20.2-vtrkiq4jcfz6wnt5z3yiht3igee2zkk6/share/cmake-3.20/Modules/CMakeDetermineCUDACompiler.cmake:174 (message):
  Failed to find nvcc.

I think I can work around it, but I'm not sure if this issue is a known one or not.

whannah1 commented 2 years ago

@mrnorman yea the CPU only build isn't set up at all.

ndkeen commented 2 years ago

I made a branch ndk/radyp which includes adding some timers and adding YAKL_PROFILE to the cori-knl build. (just pushed a commit to add a few more radiation timers in scream)

And if I did it right, It also includes updating radiation to use the branch you noted above as well as using main of YAKL.

With that ndk/radyp on cori-knl, I tried SMS_Ln24_P675x1.ne30_ne30.F2010-SCREAMv1.cori-knl_gnu

GNU compiler, 11 nodes, 64x1

cori05% .
/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/SMS_Ln24_P675x1.ne30_ne30.F2010-SCREAMv1.cori-knl_gnu.20220523_132849_nkdls8

cori05% grep -i ATM_RUN run/HommeTime_stats 
"CPL:ATM_RUN"                                        -        675      675 1.620000e+04   5.787309e+04   121.149 (   181      0)    59.528 (   375      0)

cori05% grep EAMxx::Radiation run/HommeTime_stats
"a_i:EAMxx::Radiation::init"                         -        675      675 6.750000e+02   2.364617e+03     4.721 (   627      0)     0.678 (   177      0)
"a:EAMxx::Radiation::run"                            -        675      675 1.620000e+04   3.590516e+04    85.107 (   177      0)    32.390 (   361      0)

Note I think this ~85 seconds is low as there is a large mpi imbalance for radiation, but you can see it's ~70% of ATM.

And here is a key YAKL timer in the gas optics:

cori05% zgrep "gop1 comp" run/e3sm.log.59324600.220523-144544.gz
  0: gop1 compute_gas_taus                             24          1.372411e+01   5.690169e-01   6.043691e-01   
mrnorman commented 2 years ago

I ran a test on Summit with vanilla E3SM, and I'm seeing rrtmgpxx as 17% more expensive than rrtmgp. Reproducer on E3SM current master (git hash 31b474f) with latest YAKL branch and the mrnorman/improve-cxx-performance branch on E3SM's fork of rte-rrtmgp:

./create_test SMS_Ld1_P42x1.ne4pg2_ne4pg2.F2010-CICE.summit_gnu.eam-rrtmgp
./create_test SMS_Ld1_P42x1.ne4pg2_ne4pg2.F2010-CICE.summit_gnu.eam-rrtmgpxx

Timers from timing/model_timing_stats:

rrtmgp  : "a:radiation" - 42 42 1.050000e+03   1.954533e+02  5.019 (2  0)  2.821 (20 0)
rrtmgpxx: "a:radiation" - 42 42 1.050000e+03   2.291413e+02  6.048 (2  0)  3.353 (20 0)

I'm about to run the larger ne30np4 case to see how the larger per-node workload affects things.

ndkeen commented 2 years ago

I don't think there is a problem with radiation performance in those type of tests. So it might be something peculiar with how scream is interfacing with radiation.

In the past, I also saw a slight slowdown with rrtmgpxx compared to rrtmgp -- so 17% sounds reasonable. But I suspect the time in radiation is smaller than the time in dynamics?

If it helps, I could reproduce the issue on chrysalis.

mrnorman commented 2 years ago

In this particular case, radiation consumed 20.8% of the total ATM runtime, and dynamics consumed 19.8% of the total ATM runtime.

mrnorman commented 2 years ago

OK, for ne30np4 on a single node, I'm seeing rrtmgpxx being 2.05x slower than rrtmgp, so it does get worse with larger per-node workloads. It's still a far cry from the 20-30x Noel's seeing on KNL, but it is concerning. I can't imagine why the per-node workload would affect things at the moment. If anyone has ideas, let me know.

PCOLS is 16 in both cases, so no individual call should have more work in the ne30np4 case. I confirmed in the rrtmgpxx ne30 case that only one pool per MPI task is created.

@whannah1 @brhillman do you think it's possible that ne30np4 will have different physical properties on the whole that would lead to certain aspects of rrtmgp running more than we see at ne4pg2? E.g., more aerosols, more clouds or something like that?

mrnorman commented 2 years ago

More detailed timers for rrtmgpxx SW in the ne30 run:

                                        Called  Wallclock   max        min        UTR Overhead
"a:radiation"                           - 219 - 27.803162   0.434538   0.000318   0.000026 
  "a:radheat_tend"                      - 219 -  0.001510   0.000017   0.000005   0.000026 
  "a:rad_heating_rate"                  - 219 -  0.028892   0.000201   0.000110   0.000026 
  "a:rrtmgp_check_temperatures"         -  73 -  0.000784   0.000013   0.000009   0.000009 
  "a:rad_cld_optics_sw"                 -  73 -  0.741915   0.014557   0.008836   0.000009 
  "a:rad_gas_concentrations_sw"         - 146 -  0.007638   0.000065   0.000044   0.000017 
  "a:rad_aer_optics_sw"                 -  73 -  2.718852   0.043657   0.033695   0.000009 
  "a:rad_rrtmgp_run_sw"                 -  73 -  8.186311   0.142902   0.080164   0.000009 
    "a:rrtmgpxx_run_sw_gas_optics"      -  73 -  1.884980   0.033894   0.017520   0.000009 
    "a:rrtmgpxx_run_sw_aerosol_optics"  -  73 -  0.501275   0.008935   0.004781   0.000009 
    "a:rrtmgpxx_run_sw_rte_sw_clearsky" -  73 -  2.607085   0.046776   0.026114   0.000009 
    "a:rrtmgpxx_run_cloud_optics"       -  73 -  0.390708   0.007257   0.003945   0.000009 
    "a:rrtmgpxx_run_sw_rte_sw_allsky"   -  73 -  2.572721   0.046005   0.025174   0.000009 
  "a:rad_expand_fluxes_sw"              -  73 -  0.030419   0.000625   0.000336   0.000009 
  "a:rad_heating_rate_sw"               -  73 -  0.001403   0.000026   0.000016   0.000009 
  "a:rad_cld_optics_lw"                 -  73 -  0.647640   0.010997   0.008236   0.000009 
  "a:rad_aer_optics_lw"                 -  73 -  1.682485   0.025318   0.021413   0.000009 
  "a:rrtmgp_run_lw"                     -  73 - 12.924088   0.184283   0.154209   0.000009 
mrnorman commented 2 years ago

I made some changes to the rte-rrtmgp branch (mostly cleanup). The timings are better, but not by too much. In standalone, I have the timings below.

I mentioned this earlier, but there are two Fortran versions: the original (faster) version, and the version refactored for efficiency in OpenACC on GPUs. The C++ code is patterned after the refactored Fortran code, so the appropriate comparison is the refactored Fortran code.

As you can see below, the original fortran code is more than 2x faster than the refactored fortran code. The C++ code for LW is actually faster than the Fortran refactored OpenACC code but is 2.3x slower than the original Fortran code. The C++ code for SW is 1.45x slower than the refactored OpenACC Fortran code and 2.3x slower than the original Fortran code. This is for 128 columns in standalone.

The 2.3x slowdown compared to the original Fortran code in standalone agrees pretty well with the 2x slowdown seen in ne30np4 on summit. Most of the slowdown seems to be due to the refactoring for GPU performance rather than due to using rrtmgpxx instead of rrtmgp. I'm assuming the rrtmgp option in E3SM uses the original Fortran code and not the refactored code. So I feel like the timings are now well-explained.

Regarding the 20-30x slowdown Noel is seeing, this could be either a scream configuration / integration issue or a KNL-specific issue. I don't have access to a KNL for testing, and scream doesn't run on the CPU on summit with default cime configurations, so someone else may need to look into Noel's results from here.

LW (128 cols, GNU 11)
  C++                         : 0.1898857
  Fortran (original)          : 0.0840758756
  Fortran (OpenACC refactored): 0.191884473

SW (128 cols, GNU 11)
  C++                         : 0.2358371
  Fortran (original)          : 0.100466847
  Fortran (OpenACC refactored): 0.162026659

@ndkeen , you may want to try setting the shell environment variable GATOR_DISABLE to the value 1 in the config_machines.xml file and see how it affects performance on KNL for your large per-node workload with scream. The affect of turning off the pool allocator on the CPU is typically negative on performance, but it's worth looking at just to see.

mrnorman commented 2 years ago

Another observation is that Cori's documentation shows the KNL nodes have "96 GB (DDR4), 16 GB (MCDRAM)". If the "MCDRAM" is the KNL's high-bandwidth memory, then we have a problem running 64 MPI tasks on a single node. Each MPI task will create its own memory pool of 1GB at yakl::init(), meaning 64GB is allocated from the start. This might mean the C++ code is thrashing memory to slower DDR4 because of the pool allocator. So turning it off might be the right choice for that architecture.

ndkeen commented 2 years ago

This very slow behavior is present on the AMD's as well. I assume it will be the case on any CPU.

I tried with GATOR_DISABLE=1 and the performance was slightly worse.

mrnorman commented 2 years ago

I suppose that it's a scream-specific issue then.

ndkeen commented 2 years ago

I was also assuming it was specific scream, or at least the way scream is calling the radiation.

One thing I noticed is that I'm not getting the YAKL timer output in the output when I run on GPU's. I'm building with -DYAKL_PROFILE -DHAVE_MPI -- it does work on CPU

mrnorman commented 2 years ago

Did you set those flags in the CMake variable YAKL_CUDA_FLAGS? https://github.com/mrnorman/YAKL/wiki/Using-and-Compiling-with-YAKL#table-of-backend-options

ndkeen commented 2 years ago

Hmm, I was adding those flags to the C++ compiler. We aren't doing anything with YAKL_CUDA_FLAGS.

mrnorman commented 2 years ago

The documentation is pretty detailed regarding CMake integration of YAKL and codes that use it. It's highly recommended to follow it. Here's an example from E3SM: https://github.com/E3SM-Project/E3SM/blob/master/components/cmake/build_model.cmake#L68 https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/physics/rrtmgp/cpp/CMakeLists.txt

PeterCaldwell commented 2 years ago

Given the challenges we're having with CIME not using the options we think it is, I wonder whether the "SCREAM 20x slower/standalone not" is due to problems setting flags...

ndkeen commented 2 years ago

I am just guessing here, but in components/eam/src/physics/rrtmgp/external, a global grep yields ~76 places in C++ where a YAKL_LAMBDA is used. And there are only two locations that have YAKL_DEVICE_LAMBDA, including in the routine I've narrowed down as being suspiciously slow on CPU (gas_optical_depths_minor).

./cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp:241:  parallel_for( Bounds<3>(nlay,ncol,{0,max_gpt_diff}) , YAKL_DEVICE_LAMBDA (int ilay, int icol, int igpt0) {
./cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp:28:  parallel_for( Bounds<3>({2,ngpt},nlev,ncol) , YAKL_DEVICE_LAMBDA (int igpt, int ilev, int icol) {
brhillman commented 2 years ago

I believe the YAKL_DEVICE_LAMBDA is needed in kernels that use atomic reductions.

mrnorman commented 2 years ago

That has changed with the new YAKL btw. But either way, as mentioned earlier, there is no difference between YAKL_LAMBDA and YAKL_DEVICE_LAMBDA on the CPU, nor do atomic operations actually do anything different on the CPU:

https://github.com/mrnorman/YAKL/blob/main/src/YAKL_defines.h#L59

https://github.com/mrnorman/YAKL/blob/main/src/YAKL_atomics.h#L11

ndkeen commented 2 years ago

I'm going to close this issue as the radiation for SCREAM cases on CPU's is no longer such a large percent of the total time. I had wanted to show some measurements to make it more things more transparent, but there are now several factors and it's better to move forward.

I think one of the largest issues affecting why I originally thought it was slow was a combination of two things: a) I had been unable to use more MPI's per node as there was (and still is) an issue with slingshot network that we have a work-around for and b) a known issue of RRTMGPX not making use of threads. Which meant that other parts of the ATM were seeing a benefit from threading, while the radiation was not.

For a ne30 case on pm-cpu with 43 nodes F2010-SCREAMv1-noAero.ne30_ne30, the radiation is about 47% of the ATM run time, which may be expected. This case used 128 MPI's per node and no threading. We also know that the radiation in particular has poor load balancing as we've not implemented some of the techniques used in vanilla E3SM to load balance over columns.