Open ndkeen opened 2 years ago
Are you saying that running with N and 2N ranks yields the same exec time (for RRTMGP)?
No, the opposite -- using more MPI's speeds up radiation. But it's still too slow.
Ok, that is at least not messed up.
@brhillman might have some thoughts about rad performance.
Adding some timers into our radiation interface, I see that a majority of the time spent in both SW and LW is in gas_optics
call.
in rrtmgp_sw:
54% in gas_optics
28% in rte_sw
in rrtmgp_lw:
70% in gas_optics
15% in rte_lw
cori08% grep gas_optics components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp
#include "cpp/rrtmgp/mo_gas_optics_rrtmgp.h"
k_dist.gas_optics(nday, nlay, top_at_1, p_lay_day, p_lev_day, t_lay_day, gas_concs_day, optics, toa_flux);
k_dist.gas_optics(ncol, nlay, top_at_1, p_lay, p_lev, t_lay, t_sfc, gas_concs, optics, lw_sources, real2d(), t_lev);
And if the routines being called are located here:
cori08% grep 'void gas_optics' components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/mo_gas_optics_rrtmgp.h
void gas_optics(const int ncol, const int nlay,
void gas_optics(const int ncol, const int nlay,
I need to figure out how to add timers here.
Are these just where time would be spent in YAKL kernels?
@brhillman Naive question: is it possible RRTMGP is not using the memory pool, and doing alloc/free every time it needs a temporary?
Perhaps the next step is for Ben or Noel to ask Matt Norman to comment on the situation?
@ndkeen , feel free to tag me in YAKL-related issues at the start. To turn on OpenMP CPU threading in YAKL when you want loop-level OpenMP inside the individual components, you need -DYAKL_ARCH="OPENMP" -DYAKL_OPENMP_FLAGS="-fopenmp -O3 ..."
. If you've had trouble with that, please let me know the specific trouble you're having along with a reproducer, and I'll look into it. For more info on the build process in general, please see:
https://github.com/mrnorman/YAKL/wiki/Using-and-Compiling-with-YAKL
If you find places where the documentation can be improved, please let me know, and I'm happy to improve it.
For non-threaded cases, you need to make sure -DYAKL_CXX_FLAGS="-O3 ..."
is set; otherwise optimizations aren't turned on for YAKL-compiled files.
Can you put the verbose make output lines from your scream build for RRTMGP C++ files so we can make sure the flags are appropriate? Also, the CMake configure will tell you upon add_subdirectory(yakl)
which flags YAKL is applying to YAKL files as a CMake status message.
Even without optimizations, a 20x slowdown is pretty severe without threading coming into play.
Please let me know if you have other questions.
@bartgol regarding the memory pool, when the memory pool is in use, "Create pool" is set to stdout for every pool creation. It is on by default unless the user sets export GATOR_DISABLE=1
.
Regarding hooking up YAKL's timers to the existing E3SM GPTL, please see the following documentation:
https://github.com/mrnorman/YAKL/wiki/YAKL-Timers
With this, you can get finer timers within the RRTMGP library. If you have trouble with this, please let me know, and I'll help get it working.
This is all without threads.
Hmm, OK, might this be an issue:
-- ** YAKL_ARCH not set. Building YAKL for a serial CPU backend **
-- ** YAKL is using the following C++ flags: **
or maybe thats ok
./components/cmake/build_model.cmake:70: # YAKL_ARCH can be CUDA, HIP, SYCL, OPENMP45, or empty
./components/cmake/build_model.cmake:73: set(YAKL_ARCH "CUDA")
./components/cmake/build_model.cmake:79: set(YAKL_ARCH "")
We use CUDA for gpu runs, but should be empty for CPU? I guess serial backend means MPI-only.
Here are typical flags for yakl f90 file i think:
cd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/bld/cmake-bld/externals/YAKL
&& python3
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/Tools/e3sm_compile_wrap.py
/opt/cray/pe/craype/2.7.15/bin/ftn -DMPICH_SKIP_MPICXX
-DSCREAM_CONFIG_IS_CMAKE
-I/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src
-fallow-argument-mismatch -g -cpp -ffree-line-length-none -Wall -O3
-DNDEBUG -O3 -O3 -w -c
/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src/YAKL_gator_mod.F90
-o CMakeFiles/yakl.dir/src/YAKL_gator_mod.F90.o
cd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/bld/cmake-bld/externals/YAKL
&& python3
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se03-may6/f120cpu.F2000SCREAMv1.ne120_r0125_oRRS18to6v3.se03-may6.gnu.24s.n032b64x1.Hremap512.ekni.N576.ts75.12sb.s8.nospa.pmu/Tools/e3sm_compile_wrap.py
/opt/cray/pe/craype/2.7.15/bin/CC -DMPICH_SKIP_MPICXX
-DSCREAM_CONFIG_IS_CMAKE
-I/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src
-DTHRUST_IGNORE_CUB_VERSION_CHECK -g1 -Wall -O3 -DNDEBUG -O3
-fopenmp-simd -w -std=gnu++14 -MD -MT
externals/YAKL/CMakeFiles/yakl.dir/src/YAKL.cpp.o -MF
CMakeFiles/yakl.dir/src/YAKL.cpp.o.d -o
CMakeFiles/yakl.dir/src/YAKL.cpp.o -c
/global/cfs/cdirs/e3sm/ndk/se03-may6/externals/YAKL/src/YAKL.cpp
I'm chatting with Matt on slack, but just wanted to note here that I did add some quick/temporary yakl timers to dive down a little more. If most of the slow time is in gas_optics
, for this ne30 case that's about 33 seconds. Of that, 26.4 seconds are in compute_tau_absorption
and 6.3 sec in source
. Within compute_tau_absorption
, almost all time in gas_optical_depths_major
and gas_optical_depths_minor
(which are just large pfors).
Matt asked for directions to reproduce:
To test, clone scream git clone git@github.com:E3SM-Project/scream.git
, and:
cd scream
cd cime/scripts
create_test SMS_Ln24_P64x1.ne30_ne30.F2000SCREAMv1
where compiler should be such that it will be built for CPU (depending on machine). Running for 24 steps, but I'm sure only need 2 or 4. The 64x1
should ensure no threads.
I don't yet have more detailed timers in the code, and could make a branch with them if it helps.
On just the AMD of Perlmutter, that runs in 6 minutes (shorter with fewer steps)
-- ** YAKL_ARCH not set. Building YAKL for a serial CPU backend **
-- ** YAKL is using the following C++ flags: **
Does this mean C++ files for YAKL are built without any code optimization (that is, as if -O0
was passed)? It might be worth checking if rrtmgp timers are the same as in DEBUG=ON. If that's the case, maybe the issue is that we are not getting opt flags to YAKL?
Edit: nvm, bld/cmake-bld/externals/YAKL/CMakeFiles/yakl.dir/flags.make
shows the correct CXX flags.
His verbose make output shows -O3, which makes me think that scream is setting CXXFLAGS
to pass flags into CMake for C++ source files. That's common practice for many Kokkos codes, I think, so not too surprising. I am a bit concerned about the -g
. For GNU, does that turn off opts?
Edit: OK, so maybe it's the flags.make rather than CXXFLAGS
.
It should not. -g
should only add debug symbols to the exe file, but I don't think it has any implication on the code optimizations (which is why CMake adds both -g
and -O0
for debug builds).
Even opt flags wouldn't account for a 20x slowdown, though. I think it's something else that's up. I'll work on Noel's reproducer on Summit tomorrow and dig in further.
Thanks Matt!
@whannah1 volunteered to run a simple CPU compset on Summit and Cori to see if we see a similar slowdown in E3SM. I haven't gotten to running create_test SMS_Ln24_P64x1.ne30_ne30.F2000SCREAMv1
on Summit. It's down today, but hopefully I can do it tomorrow. I don't have access to Cori, so I cannot reproduce the scream situation there.
@ndkeen , your 20x slowdown for rrtmgpxx on cori, was that with KNL or a more traditional CPU? Also, have you tried compiling with the Intel compiler? It's not clear what version of GNU you're using, but older versions do not vectorize C++ very well. Intel always seems to be faster across the board for me.
I'm hoping to give more information soon.
I see the slowdown on any CPU I've tried -- which is only cori-knl and pm-cpu (cpu-only nodes of Perlmutter, which are AMD's). On Cori, I've tried with Intel/GNU (v9). And with PM, it's with GNU (v10 and v11).
Noel, in the meantime, can you e-mail me or slack me a GPTL timer file to look at that shows this problem?
Working past my typical time, but I was able to reproduce a slowdown in standalone rte-rrtmgp. There are two things at play here:
Currently rrtmgpxx is about 3x slower in standalone than OpenACC rrtmgp in Fortran. Neither I nor @whannah1 have seen a 20x slowdown so far.
My plan is to inline the interpolation routines into the routines that call them in order to avoid the need for slicing. Hopefully that will bring the performance back up to something close to the Fortran OpenACC code.
Hmm, well that may point to a different issue. As I'm seeing a lot of time in gas_optical_depths_major
and gas_optical_depths_minor
.
I can send timing info, but note we don't currently have many timers in the code. The ones I've added are temporary with names that wouldn't make sense unless you had my local changes (which you are also happy to see). And finally, the YAKL timers I added are not in the timing files at all as they are not yet hooked up to GPTL. So they are just in e3sm.log.
gas_optical_depths_major
and gas_optical_depths_minor
make many calls to interpolate3D and interpolate2D, respectively, so that corroborates well. I don't think I'll need any timers from here. Hopefully my hunch is right and it's an easy fix.
https://github.com/E3SM-Project/rte-rrtmgp/pull/21 fixes the performance for rte-rrtmgp. After inlining interpolate2D and interpolate3D, standalone tests indicate the C++ LW is now 17% more expensive than the OpenACC Fortran code and the C++ SW is now 38% more expensive than the OpenACC Fortran. This is with GNU 9.x on my laptop, and I think it will give similar results in other contexts. I'll look into SW performance later, but for now, this should be a good improvement. This PR will require the latest YAKL main branch.
I updated YAKL to main
and updated radiation to use Matt's branch above. I ran a f30 case on cori-knl with GNU & Intel before this change as well as GNU & Intel after this change. I can show raw data, but this change is running radiation about 1.4x faster than before. However, the radiation is still WAY too slow. If I can just compare the speed of the radiation tot he speed of the dynamics, we would probably expect radiation to faster than dynamics. But after this change, radiation is about 30x slower than dynamics (for this case on this machine).
This is my local repo location to verify the changes have been included /global/cfs/cdirs/e3sm/ndk/newyakl-se03-may6
Hmm, that's not in line with standalone results. I don't see a scream branch or fork with your github username. Can you push the branch you created to run that test and mention it here?
I will have to create one. But note you can run essentially same thing with something like create_test SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1
(compset name has changed in recent scream -- and of course set MPI's to be whatever will work best on machine of choice)
I'm not sure scream is setup on Summit at the moment:
[imn@login5:~/scream/cime/scripts] 8-) ./create_test SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1.summit_gnu
...
Case dir: /gpfs/alpine/cli115/proj-shared/imn/e3sm_scratch/SMS_Ln24_P64x1.ne30_ne30.F2010-SCREAMv1.summit_gnu.20220523_161354_ku9cwj
Errors were:
WARNING: Found directory /ccs/home/imn/.cime but no cmake macros within, set env variable CIME_NO_CMAKE_MACRO to use deprecated config_compilers method
ERROR: Error! Could not find version 5.1 for package yaml.
I'll need to look into how to change the modules.
@mrnorman just do a "module load python" and it will build, but there's other problems that prevents it from running on Summit at the moment. see: https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3386015745/How+To+Run+SCREAMv1
Scream seems to be looking for CUDA even with a "gnu" build.
CMake Error at /autofs/nccs-svm1_sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/cmake-3.20.2-vtrkiq4jcfz6wnt5z3yiht3igee2zkk6/share/cmake-3.20/Modules/CMakeDetermineCUDACompiler.cmake:174 (message):
Failed to find nvcc.
I think I can work around it, but I'm not sure if this issue is a known one or not.
@mrnorman yea the CPU only build isn't set up at all.
I made a branch ndk/radyp
which includes adding some timers and adding YAKL_PROFILE to the cori-knl build.
(just pushed a commit to add a few more radiation timers in scream)
And if I did it right, It also includes updating radiation to use the branch you noted above as well as using main
of YAKL.
With that ndk/radyp
on cori-knl, I tried
SMS_Ln24_P675x1.ne30_ne30.F2010-SCREAMv1.cori-knl_gnu
GNU compiler, 11 nodes, 64x1
cori05% .
/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/SMS_Ln24_P675x1.ne30_ne30.F2010-SCREAMv1.cori-knl_gnu.20220523_132849_nkdls8
cori05% grep -i ATM_RUN run/HommeTime_stats
"CPL:ATM_RUN" - 675 675 1.620000e+04 5.787309e+04 121.149 ( 181 0) 59.528 ( 375 0)
cori05% grep EAMxx::Radiation run/HommeTime_stats
"a_i:EAMxx::Radiation::init" - 675 675 6.750000e+02 2.364617e+03 4.721 ( 627 0) 0.678 ( 177 0)
"a:EAMxx::Radiation::run" - 675 675 1.620000e+04 3.590516e+04 85.107 ( 177 0) 32.390 ( 361 0)
Note I think this ~85 seconds is low as there is a large mpi imbalance for radiation, but you can see it's ~70% of ATM.
And here is a key YAKL timer in the gas optics:
cori05% zgrep "gop1 comp" run/e3sm.log.59324600.220523-144544.gz
0: gop1 compute_gas_taus 24 1.372411e+01 5.690169e-01 6.043691e-01
I ran a test on Summit with vanilla E3SM, and I'm seeing rrtmgpxx as 17% more expensive than rrtmgp. Reproducer on E3SM current master (git hash 31b474f
) with latest YAKL branch and the mrnorman/improve-cxx-performance
branch on E3SM's fork of rte-rrtmgp:
./create_test SMS_Ld1_P42x1.ne4pg2_ne4pg2.F2010-CICE.summit_gnu.eam-rrtmgp
./create_test SMS_Ld1_P42x1.ne4pg2_ne4pg2.F2010-CICE.summit_gnu.eam-rrtmgpxx
Timers from timing/model_timing_stats
:
rrtmgp : "a:radiation" - 42 42 1.050000e+03 1.954533e+02 5.019 (2 0) 2.821 (20 0)
rrtmgpxx: "a:radiation" - 42 42 1.050000e+03 2.291413e+02 6.048 (2 0) 3.353 (20 0)
I'm about to run the larger ne30np4 case to see how the larger per-node workload affects things.
I don't think there is a problem with radiation performance in those type of tests. So it might be something peculiar with how scream is interfacing with radiation.
In the past, I also saw a slight slowdown with rrtmgpxx compared to rrtmgp -- so 17% sounds reasonable. But I suspect the time in radiation is smaller than the time in dynamics?
If it helps, I could reproduce the issue on chrysalis.
In this particular case, radiation consumed 20.8% of the total ATM runtime, and dynamics consumed 19.8% of the total ATM runtime.
OK, for ne30np4 on a single node, I'm seeing rrtmgpxx being 2.05x slower than rrtmgp, so it does get worse with larger per-node workloads. It's still a far cry from the 20-30x Noel's seeing on KNL, but it is concerning. I can't imagine why the per-node workload would affect things at the moment. If anyone has ideas, let me know.
PCOLS is 16 in both cases, so no individual call should have more work in the ne30np4 case. I confirmed in the rrtmgpxx ne30 case that only one pool per MPI task is created.
@whannah1 @brhillman do you think it's possible that ne30np4 will have different physical properties on the whole that would lead to certain aspects of rrtmgp running more than we see at ne4pg2? E.g., more aerosols, more clouds or something like that?
More detailed timers for rrtmgpxx SW in the ne30 run:
Called Wallclock max min UTR Overhead
"a:radiation" - 219 - 27.803162 0.434538 0.000318 0.000026
"a:radheat_tend" - 219 - 0.001510 0.000017 0.000005 0.000026
"a:rad_heating_rate" - 219 - 0.028892 0.000201 0.000110 0.000026
"a:rrtmgp_check_temperatures" - 73 - 0.000784 0.000013 0.000009 0.000009
"a:rad_cld_optics_sw" - 73 - 0.741915 0.014557 0.008836 0.000009
"a:rad_gas_concentrations_sw" - 146 - 0.007638 0.000065 0.000044 0.000017
"a:rad_aer_optics_sw" - 73 - 2.718852 0.043657 0.033695 0.000009
"a:rad_rrtmgp_run_sw" - 73 - 8.186311 0.142902 0.080164 0.000009
"a:rrtmgpxx_run_sw_gas_optics" - 73 - 1.884980 0.033894 0.017520 0.000009
"a:rrtmgpxx_run_sw_aerosol_optics" - 73 - 0.501275 0.008935 0.004781 0.000009
"a:rrtmgpxx_run_sw_rte_sw_clearsky" - 73 - 2.607085 0.046776 0.026114 0.000009
"a:rrtmgpxx_run_cloud_optics" - 73 - 0.390708 0.007257 0.003945 0.000009
"a:rrtmgpxx_run_sw_rte_sw_allsky" - 73 - 2.572721 0.046005 0.025174 0.000009
"a:rad_expand_fluxes_sw" - 73 - 0.030419 0.000625 0.000336 0.000009
"a:rad_heating_rate_sw" - 73 - 0.001403 0.000026 0.000016 0.000009
"a:rad_cld_optics_lw" - 73 - 0.647640 0.010997 0.008236 0.000009
"a:rad_aer_optics_lw" - 73 - 1.682485 0.025318 0.021413 0.000009
"a:rrtmgp_run_lw" - 73 - 12.924088 0.184283 0.154209 0.000009
I made some changes to the rte-rrtmgp branch (mostly cleanup). The timings are better, but not by too much. In standalone, I have the timings below.
I mentioned this earlier, but there are two Fortran versions: the original (faster) version, and the version refactored for efficiency in OpenACC on GPUs. The C++ code is patterned after the refactored Fortran code, so the appropriate comparison is the refactored Fortran code.
As you can see below, the original fortran code is more than 2x faster than the refactored fortran code. The C++ code for LW is actually faster than the Fortran refactored OpenACC code but is 2.3x slower than the original Fortran code. The C++ code for SW is 1.45x slower than the refactored OpenACC Fortran code and 2.3x slower than the original Fortran code. This is for 128 columns in standalone.
The 2.3x slowdown compared to the original Fortran code in standalone agrees pretty well with the 2x slowdown seen in ne30np4 on summit. Most of the slowdown seems to be due to the refactoring for GPU performance rather than due to using rrtmgpxx instead of rrtmgp. I'm assuming the rrtmgp option in E3SM uses the original Fortran code and not the refactored code. So I feel like the timings are now well-explained.
Regarding the 20-30x slowdown Noel is seeing, this could be either a scream configuration / integration issue or a KNL-specific issue. I don't have access to a KNL for testing, and scream doesn't run on the CPU on summit with default cime configurations, so someone else may need to look into Noel's results from here.
LW (128 cols, GNU 11)
C++ : 0.1898857
Fortran (original) : 0.0840758756
Fortran (OpenACC refactored): 0.191884473
SW (128 cols, GNU 11)
C++ : 0.2358371
Fortran (original) : 0.100466847
Fortran (OpenACC refactored): 0.162026659
@ndkeen , you may want to try setting the shell environment variable GATOR_DISABLE
to the value 1
in the config_machines.xml
file and see how it affects performance on KNL for your large per-node workload with scream. The affect of turning off the pool allocator on the CPU is typically negative on performance, but it's worth looking at just to see.
Another observation is that Cori's documentation shows the KNL nodes have "96 GB (DDR4), 16 GB (MCDRAM)". If the "MCDRAM" is the KNL's high-bandwidth memory, then we have a problem running 64 MPI tasks on a single node. Each MPI task will create its own memory pool of 1GB at yakl::init()
, meaning 64GB is allocated from the start. This might mean the C++ code is thrashing memory to slower DDR4 because of the pool allocator. So turning it off might be the right choice for that architecture.
This very slow behavior is present on the AMD's as well. I assume it will be the case on any CPU.
I tried with GATOR_DISABLE=1 and the performance was slightly worse.
I suppose that it's a scream-specific issue then.
I was also assuming it was specific scream, or at least the way scream is calling the radiation.
One thing I noticed is that I'm not getting the YAKL timer output in the output when I run on GPU's. I'm building with -DYAKL_PROFILE -DHAVE_MPI
-- it does work on CPU
Did you set those flags in the CMake variable YAKL_CUDA_FLAGS
?
https://github.com/mrnorman/YAKL/wiki/Using-and-Compiling-with-YAKL#table-of-backend-options
Hmm, I was adding those flags to the C++ compiler. We aren't doing anything with YAKL_CUDA_FLAGS
.
The documentation is pretty detailed regarding CMake integration of YAKL and codes that use it. It's highly recommended to follow it. Here's an example from E3SM: https://github.com/E3SM-Project/E3SM/blob/master/components/cmake/build_model.cmake#L68 https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/physics/rrtmgp/cpp/CMakeLists.txt
Given the challenges we're having with CIME not using the options we think it is, I wonder whether the "SCREAM 20x slower/standalone not" is due to problems setting flags...
I am just guessing here, but in components/eam/src/physics/rrtmgp/external
, a global grep yields ~76 places in C++ where a YAKL_LAMBDA
is used. And there are only two locations that have YAKL_DEVICE_LAMBDA
, including in the routine I've narrowed down as being suspiciously slow on CPU (gas_optical_depths_minor
).
./cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp:241: parallel_for( Bounds<3>(nlay,ncol,{0,max_gpt_diff}) , YAKL_DEVICE_LAMBDA (int ilay, int icol, int igpt0) {
./cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp:28: parallel_for( Bounds<3>({2,ngpt},nlev,ncol) , YAKL_DEVICE_LAMBDA (int igpt, int ilev, int icol) {
I believe the YAKL_DEVICE_LAMBDA
is needed in kernels that use atomic reductions.
That has changed with the new YAKL btw. But either way, as mentioned earlier, there is no difference between YAKL_LAMBDA and YAKL_DEVICE_LAMBDA on the CPU, nor do atomic operations actually do anything different on the CPU:
https://github.com/mrnorman/YAKL/blob/main/src/YAKL_defines.h#L59
https://github.com/mrnorman/YAKL/blob/main/src/YAKL_atomics.h#L11
I'm going to close this issue as the radiation for SCREAM cases on CPU's is no longer such a large percent of the total time. I had wanted to show some measurements to make it more things more transparent, but there are now several factors and it's better to move forward.
I think one of the largest issues affecting why I originally thought it was slow was a combination of two things: a) I had been unable to use more MPI's per node as there was (and still is) an issue with slingshot network that we have a work-around for and b) a known issue of RRTMGPX not making use of threads. Which meant that other parts of the ATM were seeing a benefit from threading, while the radiation was not.
For a ne30 case on pm-cpu with 43 nodes F2010-SCREAMv1-noAero.ne30_ne30
, the radiation is about 47% of the ATM run time, which may be expected. This case used 128 MPI's per node and no threading. We also know that the radiation in particular has poor load balancing as we've not implemented some of the techniques used in vanilla E3SM to load balance over columns.
Noting this is a general issue. It's not clear if this is simply a performance issue. There may be something else wrong (and may only be on CPU). Additionally, it does make it a little more difficult to test on cori as the model runs so slow. Can provide details, but the slowness is ~20x or more what we might expect (radiation is ~90% of the total time). The radiation time does respond to using more MPI's and there is a known issue with using threads (only using 1 in these cases). Verified still see this with master of April 27th.