Spock slow performance vs summit

oksanaguba commented 2 years ago

We noticed that on spock we cannot achieve expected on-node performance using our code, HOMME, compared with summit on-node performance.

We prepared a few simple examples which also do not show good performance on spock. Examples use cuda/hip code or kokkos library. Below are sample outputs from examples, with timers of interest in bold. Kokkos examples have a timer for without access to device memory (‘f1nd’) and with it (‘f1’). Below we also describe where examples are located on summit and how they were built.

SUMMIT raw cuda: N is 1048576 Size of Real: 8 Timer local : start 774704320, stop 774704800, diff 8.8184003829956054687500 Timer with comm : start 774528432, stop 774703840, diff 10.3657598495483398437500

SPOCK raw hip: N is 1048576 Size of Real: 8 Timer local : start 7837664, stop 7837856, diff 11.8738842010498046875000 Timer with comm : start 7837280, stop 7837472, diff 15.8701896667480468750000

SUMMIT kokkos: n 128000 m 256 f1 : 3.709793e-04 F1 check: 5.2252371674778481e+09 f1nd: 1.130819e-03 F1 check: 5.2252371674778481e+09

SPOCK kokkos: n 128000 m 256 f1 : 8.769035e-04 F1 check: 5.2252371674778481e+09 f1nd: 1.976013e-03 F1 check: 5.2252371674778481e+09

Examples are located in my home folder:

/ccs/home/onguba/for-support-oct2021/hip/hip-example
/ccs/home/onguba/for-support-oct2021/hip/kokkos-example
/ccs/home/onguba/for-support-oct2021/cuda/cuda-example
/ccs/home/onguba/for-support-oct2021/hip/kokkos-example

Raw cuda/hip examples have output of used modules, env, and a build file (makeme):

[onguba@login2.spock hip-example]$ ls
basic  basic2-flexibleType.hip      current-env  makeme  modules-loaded

Kokkos examples, using the same environment, contain kokkos build folder with a script to build ( script is kokkosbld/makescript.sh, which should be followed by ‘make install’)

[onguba@login2.spock kokkos-example]$ ls
basic  basic.cpp  basic.o  kokkosbld  kokkosinstall  Makefile  make.inc.spock  readme

and instructions on how to build in readme file.

sarats commented 2 years ago

I have copied the aforementioned examples to http://users.nccs.gov/~sarat/e3sm/hip-cuda-examples.tgz for easy retrieval.

trey-ornl commented 2 years ago

The raw test is very sensitive to compiler optimizations. Starting with the original codes basic2-flexibleType.cu and basic2-flexibleType.hip, I see similar results to @oksanaguba's on Summit and Spock.

Summit:

Timer local : start 1170216432, stop 1170216912, diff 8.8431682586669921875000 
Timer with comm : start 1169941024, stop 1170215952, diff 10.4978237152099609375000

Spock:

Timer local : start 7842016, stop 7842208, diff 11.3833341598510742187500 
Timer with comm : start 7841632, stop 7841824, diff 16.1143627166748046875000

Now consider the following small code change to both versions, .cu and .hip: adding the __restrict__ qualifier to the y argument.

< void saxpy(int n, real a, real *x, real *y)
---
> void saxpy(int n, real a, real *x, real *__restrict__ y)

Summit performance improves by 1.3x, while Spock improves by almost 20x.

Summit:

Timer local : start 824690560, stop 824691040, diff 6.7112321853637695312500 
Timer with comm : start 824435232, stop 824690080, diff 8.3913602828979492187500

Spock:

Timer local : start 7842016, stop 7842208, diff 0.5828769803047180175781 
Timer with comm : start 7841632, stop 7841824, diff 4.1423788070678710937500

What's going on? Consider the kernel being timed.

__global__ void saxpy(int n, real a, real *x, real *__restrict__ y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  //if (i < n) y[i] = a*x[i] + y[i];
  if (i < n) {
      for(int ll = 0; ll < 1000; ll++){
      y[i] = a*cos(x[i]) + y[i];}
  }
}

Note that the term a*cos(x[i]) is constant across the loop over ll. With the __restrict__ on y, the compiler knows that it doesn't overlap with x. And, since x is never written, the compiler is free to hoist a*cos(x[i]) out of the ll loop. I suspect that hipcc (clang) figured this out for the Hip case, but nvcc did not for Cuda. There may be compiler options or directives for nvcc that could help.

I saw a similar situation a generation ago on the Cray X1. In that case, however, the compiler replaced the whole loop with a single multiplication, like this (but in Fortran, of course):

y[i] += 1000*a*cos(x[i]);

Another optimization with a disproportionate effect is switching from malloc to cudaMallocHost/hipHostMalloc. Here are results with this added to the existing __restrict__ optimization: minor speedup for Summit, 2x speedup for Spock.

Summit

Timer local : start 959976960, stop 959977440, diff 6.8231678009033203125000 
Timer with comm : start 957407776, stop 959976480, diff 7.3896641731262207031250

Spock:

Timer local : start 7844880, stop 7845072, diff 0.2550370097160339355469 
Timer with comm : start 7844496, stop 7844688, diff 2.8379058837890625000000

Here are a couple of possible takeaways for using Kokkos with a Hip backend on AMD GPUs.

It's more important to use the Kokkos Restrict memory trait?
It's more important to use pinned host memory, either through HIPHostPinnedSpace or a HostMirror?

sarats commented 2 years ago

Thanks @twhite-cray for the rapid turnaround! Given the myriad constraints compiler work with, 'restrict' definitely helps inform application of optimizations for non-overlapping memory regions.

The impact of hipHostMalloc over malloc is interesting (pinned memory). Does hipHostMalloc support aligned allocations as well?

We are not using unified memory but as I recall, that style was (is?) very sub-optimal on AMD's.

mt5555 commented 2 years ago

Thanks @twhite-cray !

Is there an easy way we could determine if either of these two issues are impacting our kokkos code? I dont know enough to know the restrict keyword would apply to kokkos, since I assume kokkos will be generating all the code related to these loops.

mt5555 commented 2 years ago

@twhite-cray : one of our Sandia developers answered my question above. The restrict keyword is enabling the AMD to move the cos() out of the loop (and thus defeating the purpose of this simple test). He thought the malloc() was also a red herring: that's just a host side array that is somehow messing with the timing of the kernel. So these two issues probably wont help our kokkos code.

Would you be able to look at the kokkos example also included above? It will probably be more challenging...

trey-ornl commented 2 years ago

I had some success improving the performance of the Kokkos examples. They both have significant re-use of values that we can take advantage of using Kokkos::TeamPolicy and shared memory (AKA LDS) to manually cache values within a team (AKA Cuda threadblock or Hip workgroup).

Assume the following definitions.

using Team = ko::TeamPolicy<>;
using Player = Team::member_type;

We can tune the F1 functor by loading values into shared memory for re-use and by computing divisions outside of the loop. Here is the resulting F1::operator().

  KOKKOS_INLINE_FUNCTION void operator() (const Player &me) const {
    const int iLo = me.league_rank()*me.team_size();
    const int jLo = iLo-m;
    const int dk = me.team_size();
    const int nk = dk+m+m+1;
    Real *const at = reinterpret_cast<Real*>(me.team_shmem().get_shmem(nk*sizeof(Real)));
    const int kLo = me.team_rank();
    const Real divM = Real(1)/Real(m);
    for (int k = kLo; k < nk; k += dk) {
      const int j = jLo+k;
      const int jp = (j + n) % n;
      const Real f = 1 - Real(std::abs(j))*divM;
      at[k] = a(jp)*f;
    }
    me.team_barrier();
    const int i = iLo+me.team_rank();
    if (i < n) {
      Real accum = 0;
      const int kHi = kLo+m+m+1;
      for (int k = kLo; k < kHi; k++) accum += at[k];
      b(i) = accum;
    }
  }

To get the greatest re-use, we make the team size as large as possible. F1::run() becomes the following.

  void run () {
    constexpr int teamSize = 1024;
    ko::parallel_for(Team((n+teamSize-1)/teamSize,teamSize).set_scratch_size(0, ko::PerTeam((teamSize+m+m+1)*sizeof(Real))), *this);
  }

Here is a plot of times for the various implementations for different values of N, where M = 256. The tuned version is much faster on both V100 and MI100, and MI100 is faster than V100 for larger values of N.

We can apply similar techniques to F1nodata::operator() to re-use the sines and divisions.

  KOKKOS_INLINE_FUNCTION void operator() (const Player &me, Real& acc) const {
    const int iLo = me.league_rank()*me.team_size();
    const int jLo = iLo-m;
    const int dk = me.team_size();
    const int nk = dk+m+m+1;
    const Real divM = Real(1)/Real(m);
    const Real pi2dn = Real(2)*Real(M_PI)/Real(n);
    Real *const a = reinterpret_cast<Real*>(me.team_shmem().get_shmem(nk*sizeof(Real)));
    const int kLo = me.team_rank();
    for (int k = kLo; k < nk; k += dk) {
      const int j = jLo+k;
      const Real f = Real(1)-Real(std::abs(j))*divM;
      a[k] = std::sin(Real(j)*pi2dn)*f;
    }
    me.team_barrier();
    const int i = iLo+me.team_rank();
    Real accum = 0;
    if (i < n) {
      const int kHi = kLo+m+m+1;
      for (int k = kLo; k < kHi; k++) accum += a[k];
      acc += accum;
    }
  }

  void run () {
    Real acc = 0;
    constexpr int teamSize = 1024;
    ko::parallel_reduce(Team((n+teamSize-1)/teamSize,teamSize).set_scratch_size(0, ko::PerTeam((teamSize+m+m+1)*sizeof(Real))), *this, acc);
    accum = acc;
  }

Here is a plot of times for the various implementations for different values of N, where M = 256. Again, the tuned version is faster on both V100 and MI100, and MI100 is a hair faster than V100 for larger values of N. In addition, the tuned F1nodata is almost as fast as the tuned F1 even though the former performs a global reduction.

If you have similar re-use of computations in E3SM, Kokkos teams and shared memory could be an effective tuning strategy.

oksanaguba commented 2 years ago

Since the original example was too trivial to optimize, to move a cos out of the loop, I changed it to (using restrict for x)

__global__
void saxpy(int n, real a, real * __restrict__ x, real * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  //if (i < n) y[i] = a*x[i] + y[i];
  if (i < n) {
      for(int ll = 0; ll < 1000; ll++){
      y[i] = a*cos(x[i] + y[i] + ll);}
  }
}

Summit timers

Timer local : start 749058336, stop 749058816, diff 8.8207683563232421875000 
Timer with comm : start 744250800, stop 744666064, diff 10.3797435760498046875000

Spock timers

Timer local : start 7837088, stop 7837280, diff 11.3655920028686523437500 
Timer with comm : start 7836704, stop 7836896, diff 15.3295888900756835937500

So, restrict does not resolve the issue that Spock is slower when cosine cannot be optimized out of the loop?

trey-ornl commented 2 years ago

One thing that I'm seeing is that the best tuning strategies can be very sensitive to detailed specifics of the example. I wonder if "real" E3SM code has very different characteristics in terms of unrolling and data re-use than the thousand-cosine loop above. @oksanaguba, could you point me to the E3SM code that you are trying to represent?

oksanaguba commented 2 years ago

@twhite-cray thanks -- I am cleaning my branch and preparing scripts for you to use, will let you know as soon as I am done.

oksanaguba commented 2 years ago

CLONE info:

E3SM repo: git@github.com:E3SM-Project/E3SM.git

branch oksanaguba/homme/spock

COMMIT:

commit 2f173cf4e0adb7e3abf846b3804518bead7b2c00 (HEAD -> oksanaguba/homme/spock, origin/oksanaguba/homme/spock)
Author: Oksana Guba <oksanaguba@gmail.com>
Date:   Wed Nov 10 17:31:54 2021 -0500

    set the last pair for hip manuall

Update submodules, too:

git submodule sync; git submodule update  --recursive --init

FOR SPOCK:

myfol=/gpfs/alpine/cli115/scratch/onguba/for-support-nov09.2021/spock
ufol=your-build-spock

mkdir $ufol
cd $ufol
cp $myfol/build-hipcc.sh .

# in the build script, modify path to your clone 
# like this: 
# HOMME=${YOUR_CLONE}/components/homme
# where YOUR_CLONE is the root folder of the E3SM clone

source $myfol/load-spock-modules
source build-hipcc.sh 
mkdir run1
cd run1
mkdir movies
cp $myfol/run1/acme* .
cp $myfol/xx*nl .
cp $myfol/run.sh .
source run.sh
grep prim_main time*

EXAMPLE OUTPUT SPOCK:

grep prim_main *

time-preqx-xx-ne10-nmax500.298603:"prim_main_loop"                      -          1        1 1.000000e+00   6.152305e+01    61.523 (     0      0)    61.523 (     0      0)

which means the main part of the simulation took 62 seconds.

FOR SUMMIT:

myfol=/gpfs/alpine/cli115/scratch/onguba/for-support-nov09.2021/summit

build script is now build-summit.sh , otherwise instructions are the same.

EXAMPLE OUTPUT SUMMIT:

time-preqx-xx-ne10-nmax500.1616177:"prim_main_loop"                      -          1        1 1.000000e+00   2.649569e+01    26.496 (     0      0)    26.496 (     0      0)

NOTE that for Spock build I have to use an external build outside of E3SM submodules, but it seems that permissions are ok.

UPDATE:

Someone pointed to me that I was missing DNDEBUG in the flags (there were other flags suggested, too, but those did not make a difference). This flag shrank the timer 2x:

time-preqx-xx-ne10-nmax500.298603:"prim_main_loop"                      -          1        1 1.000000e+00   6.150645e+01    61.506 (     0      0)    61.506 (     0      0)
time-preqx-xx-ne10-nmax500.298621:"prim_main_loop"                      -          1        1 1.000000e+00   3.113658e+01    31.137 (     0      0)    31.137 (     0      0)

I pushed the change to the branch.

oksanaguba commented 2 years ago

@twhite-cray -- if you have already read the original comment above, please, see the update at the end of it. It seems to be resolved. thanks!

oksanaguba commented 2 years ago

RS-notnorm-spock-summit

Spock vs Summit, efficiency metric.

trey-ornl commented 2 years ago

Calls to assert in Hip kernels can be very expensive, even if they are always true. They turn on printf support in the kernels, which uses up lots of registers and may cause register spilling. I think -DNDEBUG turns off all the assert calls.

trey-ornl commented 2 years ago

@oksanaguba, I'm not in the project CLI115, so I can't read $myfol. Could you copy the directory into /gpfs/alpine/cli115/world-shared/onguba?

For the runs on your plot, does ne10 mean NP=10? And what was the number of vertical levels, nlev?

Thanks!

oksanaguba commented 2 years ago

I copied the folder to the one you suggested. Builds are probably ruined, but I could re-do that. HIP kokkos build in in my home folder and it seems that it is readable (unless I missed something).

ne10 means 10 10 6 finite elements per mesh (this is a cube-sphere mesh with 6 sides and 10 elements per egde), NP=4 is the number of GLL points per element's edge (so, 16 GLL points per element). We usually do not change NP. GLL points are 'repeated', so neighbor elements have they own edge GLLs, which are then synchronized.

In these runs, number of vertical levels is 72.

thanks

trey-ornl commented 2 years ago

@oksanaguba, I was able to reproduce the 31s run on Spock. Thanks!

I want to test some code optimizations. In preparation for testing, I tried inserting a bug (commented out compute_qtens(kv) in run_tracer_phase), but it didn't appear to change any of the output, neither standard out nor mass.out.

Is there a change that I can make to the test so that it prints out results that I can compare for correctness?

oksanaguba commented 2 years ago

This will slow down the run but it will print state of the model much more often. Change

statefreq=999999
disable_diagnostics = .true.

in namelists to

statefreq=1
disable_diagnostics = .false.

You should be able to see much more output, let me know if not.

oksanaguba commented 2 years ago

Regarding correctness in general, this may require more extensive testing, with netcdf and eye ball norms. When you have suggestions, we can re-run and look.

That is, if you introduce nonbfb changes via optimization, there is no easy way to tell that the code remains correct (with some exceptions about conservation that the model prints out).

If you introduce bfb changes and the output remains identical, it is not guaranteed that the code is identical. What we do then is run 2 codes with netcdf output, and compare.

oksanaguba commented 2 years ago

^^ updated my comment

mattdturner commented 2 years ago

@oksanaguba I'd like to start investigating the performance differences between V100 and MI250x. It appears that the directories you previously made available (in /gpfs/alpine/cli115/world-shared/onguba/for-support-nov09.2021/) might have been scrubbed. I do not see any build-hipcc.sh, run.sh, etc. Would it be possible for you to re-populate this directory?

oksanaguba commented 2 years ago

@mattdturner thanks, yes, the folder is gone. I reran everything in folder

/gpfs/alpine/cli115/world-shared/onguba/for-support-apr2022

Note that my kokkos build is not anymore available, but you could build using kokkos build script that is in the same folder. my kokkos repo's commit is outdated, i do not think it matters much.

i just ran my new hommexx build, timers are comparable except for the high workload case, ne20 (which is 20*36 finite elements), the performance degraded. I did it quick, so i might have missed something, or my binding options are not valid anymore.

oksanaguba commented 2 years ago

Also, use the most recent commit for the branch oksanaguba/homme/spock .

rljacob commented 2 years ago

So with ne20 there's no performance degradation? Does the M100 start to outperform the V100 if you try a larger problem?

oksanaguba commented 2 years ago

There is degradation for ne20 run today on spock compared to a run from February on spock. For hommexx so far, MI100 is slower than V100, any results, from today and from February. Btw, ne20 does not fit on V100, too big.

In detail, the timers today are [5 7 8 14 32 71 139] for problem sizes [ne2 ne3 ne4 ne6 ne10 ne15 ne20], while before they were [5, 7, 8, 14, 31, 70, 121].

oksanaguba commented 2 years ago

@twhite-cray if i remember correctly, you asked me to re-create homme benchmark runs on crusher? if so, updated instructions are below.

The folder is /gpfs/alpine/cli115/world-shared/onguba/for-support-june2022-crusher (FOLDER). The branch is oksanaguba/homme/spock. The commit there is commit 41f0c42e4550e8d1782153a650659c38e21846f6 (HEAD -> oksanaguba/homme/spock, origin/oksanaguba/homme/spock) Kokkos repo: branch develop, commit commit 24b3eb42ba72cc4ea4dae441916dd6db8dcb8613 .

Modules used for both kokkos build and homme build are in FOLDER, file load-crusher-modules .

How to build kokkos -- in FOLDER look for file build-kokkos/build*sh .

How then to build homme -- in FOLDER, look for file build-hipcc-preqx.sh, there change HOMME variable to point to your clone and possibly point to your own kokkos install, variable E3SM_KOKKOS_PATH .

After building homme: make a directory run2, mkdir run2/movies, copy to run2 acme* files from FOLDER/run1, copy FOLDER/run.sh script too, if needed, point to your build of homme and to file with proper modules, submit run.sh.

run.sh uses 2 GCDs, 2 ranks. Please, let me know if there are questions.

trey-ornl commented 2 years ago

@oksanaguba, yes, thanks! I won't be able to get to it this week, and I'm out of the office all next week, so I copied your for-support-june2022-crusher to HPSS. I hope to take a deeper look at this early in July.

oksanaguba commented 2 years ago

@twhite-cray noticed that two items were missing from instructions -- rename libkokkoscore.a to libkokkos.a and copy run1/xxinput.nl to run2.

trey-ornl commented 1 year ago

I made some progress looking into performance of this case. A rocprof run of ne20 with a single MPI task gave the following top kernels.

Kernel	Calls	Total time (s)	% of all kernel time
AALTracerPhase	1503	31.6	21.7
CaarFunctorImpl	2505	17.5	12.0
BIHPreNup	501	15.2	10.5
BIHPostConstHV	501	14.5	10.0
compute_qmin_qmax	1503	12.9	8.9

I performed experiments reimplementing the top kernel, AALTracerPhase. First, I tried to use a Kokkos::TeamPolicy of size NP*NP, which the hope that I could use set_chunk_size to block it into 4 or 8 vertical levels. The commit of Kokkos that this version of Hommexx uses doesn't appear to implement blocking, however. Despite setting the chunk size, I always get workgroups of size 1x16x1, not the desired 1x16x4 or 1x16x8.

So I tried a TeamPolicy of size warpSize, and I partitioned it into TEAM_LEV*NP*NP threads, where TEAM_LEV = warpSize/(NP*NP). For AMD GPUs like those on Crusher, TEAM_LEV = 4. I then launched the kernel with ne*nq*NUM_LEV/TEAM_LEV workgroups.

This configuration allows the following optimizations.

High parallelism, with one thread per grid point.
Efficient reductions using shuffles within a single warp/wavefront.
Moderate thread divergence within the limiter because of relatively few levels per block.

The following charts compare the performance of the original AALTracerPhase kernel with my implementation using the new optimizations, all using a single MI250X GCD on Crusher. "Level major" means the threads in the team have the level as the fastest dimension. "Point major" means the threads in the team have the level as the slowest dimension.

These optimizations show a potential for 2x performance improvement in this kernel. You can find the code here: https://github.com/twhite-cray/E3SM/tree/twhite-cray/hommmexx-preqx/crusher

My implementation has some issues regarding production use.

It uses a monolithic lambda function. This allowed me to more-easily understand variable and register use by the compiler. Use of functions should be possible, but it may be important to avoid recomputing array offsets and thread indices.
It uses hand-written all-sum reductions that rely on Hip-specific shuffles. The "Point major" implementation emulates a TeamPolicy with working chunking, which may be available in newer versions of Kokkos. The faster "Level major" implementation performs its reduction with the chunk as stride, which I don't think is supported by Kokkos hierarchical parallelism.

trey-ornl commented 1 year ago

I looked at the next kernel on the list, CaarFunctorImpl. It does multiple scan reductions in the vertical, along with horizontal gradient computations. Here are a couple of ideas for TeamPolicy optimizations.

Try warp/wavefront-sized teams with a grid that includes parallelism over all element points. This allows fast scan operations in the vertical, but it relies on GPU caching to help with horizontal gradient computations.
Try fitting a column of elements into a large team. Because NP*NP*NUM_LEV= 1152 > 1024, use NP*NP*NUM_LEV/2 = 576. Each thread is thus responsible for two points. Local shared memory can help with both vertical scans and horizontal gradients.

I'd like to discuss the performance results and code prototype from my previous comment before proceeding.

trey-ornl commented 1 year ago

Good news: I was able to run this Hommexx test case with Kokkos 3.7. Bad news: Performance was the same, and Kokkos::TeamPolicy::set_chunk_size still does nothing. I checked the Kokkos Slack, and it confirmed that this is the current expected behavior.

trey-ornl commented 1 year ago

I got an error when I ran with limiter_option = 0, so I compiled and ran AALTracerPhase and the point-major version of my replacement with the limiter code eliminated at compile time using #if 0. Here are the resulting kernel times.

The limiter is not as significant as I thought, but it is a larger fraction of the optimized time. In one of my previous experiments, I did see a significantly higher runtime from a "bad" implementation of the limiter on my part. I guess that led to my prejudice against it.

trey-ornl commented 1 year ago

I was able to reproduce the performance of the "point major" implementation using only Kokkos. The strategy is to use "team" parallelism across a block of vertical levels and "vector" parallelism across each element. The filter uses parallel_reduce with ThreadVectorRange for the sums across element points. https://github.com/twhite-cray/E3SM/commit/ee749b3b6ec13bbb700d3d9f5aa6d9823845210c

I think these changes should be portable across native GPU implementations of Kokkos, but they may not work with OpenMP or OpenMP offload because they rely on all "vector" threads of the team having independent values for all local variables.

oksanaguba commented 1 year ago

@twhite-cray -- thanks! Could I please re-iterate some things we have talked about: -- Is it correct that the current impl of the limiter 8 is not good for HIP, therefore, your changes are needed? -- Is it correct that the limiter 8 does not take too much time overall wrt the total timer? (i haven't tried that run myself yet). -- Is there an insight from that you have done for the limiter about why the rest of the code is not as performant on HIP?

We (or I) can look again at which parts of the code are particularly slow on crusher compared to summit. Unrelated to your improvements, i still need to try 2 more suggestions about homme HIP performance.

thanks again!

trey-ornl commented 1 year ago

-- Is it correct that the current impl of the limiter 8 is not good for HIP, therefore, your changes are needed? -- Is it correct that the limiter 8 does not take too much time overall wrt the total timer? (i haven't tried that run myself yet).

The limiter code itself does not appear to be the cause of the relative slowdown between the existing code and mine. The plot above shows that little time was gained by turning it off.

The performance improvement likely comes from a combination of the following.

Greater parallelism for the kernel launch. The two implementations have the same team sizes, 4x4x4. But I launch with an additional multiplier of NUM_LEV/4 threads (for a total of elements x tracers x NUM_LEV/4).
Scratch/shared variables instead of global-memory buffers.
Greater occupancy through a lambda that captures only the variables it needs instead of all the variables in *this.

-- Is there an insight from that you have done for the limiter about why the rest of the code is not as performant on HIP?

I've been looking at CaarFunctorImpl, and I think it could benefit from a different TeamPolicy, with teams of 16 with warpSize vectors (32 in Cuda, 64 in Hip). The idea is to maximize the parallelism in the vertical dimension while still allowing fast Kokkos scan operations. Other kernels that rely on scans in the vertical may benefit from this as well.

AALTracerPhase has no vertical scans, so it didn't need such a change. The sums in the limiter favored Kokkos vectors in the horizontal instead.

trey-ornl commented 1 year ago

Here's an update on optimizations for CaarFunctorImpl. The code changes start here: https://github.com/twhite-cray/E3SM/blob/babeafcd22250d881cb015c71a3f5a5a06538165/components/homme/src/preqx_kokkos/cxx/CaarFunctorImpl.hpp#L217

First, I refactored the kernel into a single lambda, mostly so I could see what is going on in the code. This refactoring included the following optimization strategies.

Use hierarchical parallelism targeting a column per workgroup, with maximum parallelism of 1024 threads, using Kokkos teams of 16 (NP x NP) and Kokkos "vectors" of the warp/wavefront size (64 for AMD). The warp-sized vectors allow efficient scan operations. The original code uses a similar hierarchy, but it doesn't push the vector length up. I suspect this change improved performance significantly.
Specialize the kernel for the test problem. I only implemented code paths within the original kernel that were actually used by the test problem. The primary goal was to not write code that I had no convenient way to test. This may have improved performance by reducing register pressure. The particular code specializations were rsplit != 0 and n0_qdp < 0. If either of these conditions are false, the original kernel is used.
Use Kokkos scratch variables (AKA "shared", AKA "LDS = local data store") instead of buffers in global memory for the first gradient computation. I did more conversions to scratch later.
Use lambda capture with local copies of relevant data instead of operator() with *this. The goal was to limit the data structures copied to each workgroup, instead of always copying the entire CaarFunctorImpl object. I don't know if this optimization was important for performance, but it did help me to see dependences within the code.
Minimize Kokkos indexing within the kernel. Instead, assign local pointers and use those. The goal was to avoid redundant index computations and to reduce register pressure. I was impatient and implemented this optimization as I went. A more-methodical approach would have been to use Kokkos views throughout, reimplement with local pointers, and compare performance.
Fuse loops. The original code uses a relatively deep hierarchy of function calls, with vector loops at the leaves. As I refactored, I fused some adjacent loops into single loop bodies.

The above summarizes the initial optimizations implemented before I did any performance comparisons. Then I performed a series of refinements to reduce the use of buffers in global memory, by moving them to Kokkos scratch or local variables. The move of global variables to local is an important optimization for GPUs that is simplified by the monolithic-kernel approach. Attempts to make kernels more modular through function calls should carefully consider how to enable use of scratch and local variables.

Here are timings comparing the original kernel functor for CaarFunctorImpl, "Original", my initial refactored kernel, "Initial", and my current refactored kernel, "Current". The times are from the GPTL "caar compute" timer for a single-GCD ne20 run on Crusher.

One or more additional conversions of global buffers to scratch may be possible, but Crusher is undergoing maintenance today, and I wanted to go ahead and update my progress.

E3SM-Project / E3SM

Spock slow performance vs summit #4640