Closed oksanaguba closed 4 months ago
I have copied the aforementioned examples to http://users.nccs.gov/~sarat/e3sm/hip-cuda-examples.tgz for easy retrieval.
The raw test is very sensitive to compiler optimizations. Starting with the original codes basic2-flexibleType.cu
and basic2-flexibleType.hip
, I see similar results to @oksanaguba's on Summit and Spock.
Summit:
Timer local : start 1170216432, stop 1170216912, diff 8.8431682586669921875000
Timer with comm : start 1169941024, stop 1170215952, diff 10.4978237152099609375000
Spock:
Timer local : start 7842016, stop 7842208, diff 11.3833341598510742187500
Timer with comm : start 7841632, stop 7841824, diff 16.1143627166748046875000
Now consider the following small code change to both versions, .cu
and .hip
: adding the __restrict__
qualifier to the y
argument.
< void saxpy(int n, real a, real *x, real *y)
---
> void saxpy(int n, real a, real *x, real *__restrict__ y)
Summit performance improves by 1.3x, while Spock improves by almost 20x.
Summit:
Timer local : start 824690560, stop 824691040, diff 6.7112321853637695312500
Timer with comm : start 824435232, stop 824690080, diff 8.3913602828979492187500
Spock:
Timer local : start 7842016, stop 7842208, diff 0.5828769803047180175781
Timer with comm : start 7841632, stop 7841824, diff 4.1423788070678710937500
What's going on? Consider the kernel being timed.
__global__ void saxpy(int n, real a, real *x, real *__restrict__ y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
//if (i < n) y[i] = a*x[i] + y[i];
if (i < n) {
for(int ll = 0; ll < 1000; ll++){
y[i] = a*cos(x[i]) + y[i];}
}
}
Note that the term a*cos(x[i])
is constant across the loop over ll
. With the __restrict__
on y
, the compiler knows that it doesn't overlap with x
. And, since x
is never written, the compiler is free to hoist a*cos(x[i])
out of the ll
loop. I suspect that hipcc
(clang
) figured this out for the Hip case, but nvcc
did not for Cuda. There may be compiler options or directives for nvcc
that could help.
I saw a similar situation a generation ago on the Cray X1. In that case, however, the compiler replaced the whole loop with a single multiplication, like this (but in Fortran, of course):
y[i] += 1000*a*cos(x[i]);
Another optimization with a disproportionate effect is switching from malloc
to cudaMallocHost
/hipHostMalloc
. Here are results with this added to the existing __restrict__
optimization: minor speedup for Summit, 2x speedup for Spock.
Summit
Timer local : start 959976960, stop 959977440, diff 6.8231678009033203125000
Timer with comm : start 957407776, stop 959976480, diff 7.3896641731262207031250
Spock:
Timer local : start 7844880, stop 7845072, diff 0.2550370097160339355469
Timer with comm : start 7844496, stop 7844688, diff 2.8379058837890625000000
Here are a couple of possible takeaways for using Kokkos with a Hip backend on AMD GPUs.
Restrict
memory trait?HIPHostPinnedSpace
or a HostMirror
?Thanks @twhite-cray for the rapid turnaround! Given the myriad constraints compiler work with, 'restrict' definitely helps inform application of optimizations for non-overlapping memory regions.
The impact of hipHostMalloc over malloc is interesting (pinned memory). Does hipHostMalloc support aligned allocations as well?
We are not using unified memory but as I recall, that style was (is?) very sub-optimal on AMD's.
Thanks @twhite-cray !
Is there an easy way we could determine if either of these two issues are impacting our kokkos code? I dont know enough to know the restrict keyword would apply to kokkos, since I assume kokkos will be generating all the code related to these loops.
@twhite-cray : one of our Sandia developers answered my question above. The restrict keyword is enabling the AMD to move the cos() out of the loop (and thus defeating the purpose of this simple test). He thought the malloc() was also a red herring: that's just a host side array that is somehow messing with the timing of the kernel. So these two issues probably wont help our kokkos code.
Would you be able to look at the kokkos example also included above? It will probably be more challenging...
I had some success improving the performance of the Kokkos examples. They both have significant re-use of values that we can take advantage of using Kokkos::TeamPolicy
and shared memory (AKA LDS) to manually cache values within a team (AKA Cuda threadblock or Hip workgroup).
Assume the following definitions.
using Team = ko::TeamPolicy<>;
using Player = Team::member_type;
We can tune the F1
functor by loading values into shared memory for re-use and by computing divisions outside of the loop. Here is the resulting F1::operator()
.
KOKKOS_INLINE_FUNCTION void operator() (const Player &me) const {
const int iLo = me.league_rank()*me.team_size();
const int jLo = iLo-m;
const int dk = me.team_size();
const int nk = dk+m+m+1;
Real *const at = reinterpret_cast<Real*>(me.team_shmem().get_shmem(nk*sizeof(Real)));
const int kLo = me.team_rank();
const Real divM = Real(1)/Real(m);
for (int k = kLo; k < nk; k += dk) {
const int j = jLo+k;
const int jp = (j + n) % n;
const Real f = 1 - Real(std::abs(j))*divM;
at[k] = a(jp)*f;
}
me.team_barrier();
const int i = iLo+me.team_rank();
if (i < n) {
Real accum = 0;
const int kHi = kLo+m+m+1;
for (int k = kLo; k < kHi; k++) accum += at[k];
b(i) = accum;
}
}
To get the greatest re-use, we make the team size as large as possible. F1::run()
becomes the following.
void run () {
constexpr int teamSize = 1024;
ko::parallel_for(Team((n+teamSize-1)/teamSize,teamSize).set_scratch_size(0, ko::PerTeam((teamSize+m+m+1)*sizeof(Real))), *this);
}
Here is a plot of times for the various implementations for different values of N
, where M = 256
. The tuned version is much faster on both V100 and MI100, and MI100 is faster than V100 for larger values of N
.
We can apply similar techniques to F1nodata::operator()
to re-use the sines and divisions.
KOKKOS_INLINE_FUNCTION void operator() (const Player &me, Real& acc) const {
const int iLo = me.league_rank()*me.team_size();
const int jLo = iLo-m;
const int dk = me.team_size();
const int nk = dk+m+m+1;
const Real divM = Real(1)/Real(m);
const Real pi2dn = Real(2)*Real(M_PI)/Real(n);
Real *const a = reinterpret_cast<Real*>(me.team_shmem().get_shmem(nk*sizeof(Real)));
const int kLo = me.team_rank();
for (int k = kLo; k < nk; k += dk) {
const int j = jLo+k;
const Real f = Real(1)-Real(std::abs(j))*divM;
a[k] = std::sin(Real(j)*pi2dn)*f;
}
me.team_barrier();
const int i = iLo+me.team_rank();
Real accum = 0;
if (i < n) {
const int kHi = kLo+m+m+1;
for (int k = kLo; k < kHi; k++) accum += a[k];
acc += accum;
}
}
void run () {
Real acc = 0;
constexpr int teamSize = 1024;
ko::parallel_reduce(Team((n+teamSize-1)/teamSize,teamSize).set_scratch_size(0, ko::PerTeam((teamSize+m+m+1)*sizeof(Real))), *this, acc);
accum = acc;
}
Here is a plot of times for the various implementations for different values of N
, where M = 256
. Again, the tuned version is faster on both V100 and MI100, and MI100 is a hair faster than V100 for larger values of N
. In addition, the tuned F1nodata
is almost as fast as the tuned F1
even though the former performs a global reduction.
If you have similar re-use of computations in E3SM, Kokkos teams and shared memory could be an effective tuning strategy.
Since the original example was too trivial to optimize, to move a cos out of the loop, I changed it to (using restrict for x)
__global__
void saxpy(int n, real a, real * __restrict__ x, real * y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
//if (i < n) y[i] = a*x[i] + y[i];
if (i < n) {
for(int ll = 0; ll < 1000; ll++){
y[i] = a*cos(x[i] + y[i] + ll);}
}
}
Summit timers
Timer local : start 749058336, stop 749058816, diff 8.8207683563232421875000
Timer with comm : start 744250800, stop 744666064, diff 10.3797435760498046875000
Spock timers
Timer local : start 7837088, stop 7837280, diff 11.3655920028686523437500
Timer with comm : start 7836704, stop 7836896, diff 15.3295888900756835937500
So, restrict does not resolve the issue that Spock is slower when cosine cannot be optimized out of the loop?
One thing that I'm seeing is that the best tuning strategies can be very sensitive to detailed specifics of the example. I wonder if "real" E3SM code has very different characteristics in terms of unrolling and data re-use than the thousand-cosine loop above. @oksanaguba, could you point me to the E3SM code that you are trying to represent?
@twhite-cray thanks -- I am cleaning my branch and preparing scripts for you to use, will let you know as soon as I am done.
CLONE info:
E3SM repo: git@github.com:E3SM-Project/E3SM.git
branch oksanaguba/homme/spock
COMMIT:
commit 2f173cf4e0adb7e3abf846b3804518bead7b2c00 (HEAD -> oksanaguba/homme/spock, origin/oksanaguba/homme/spock)
Author: Oksana Guba <oksanaguba@gmail.com>
Date: Wed Nov 10 17:31:54 2021 -0500
set the last pair for hip manuall
Update submodules, too:
git submodule sync; git submodule update --recursive --init
FOR SPOCK:
myfol=/gpfs/alpine/cli115/scratch/onguba/for-support-nov09.2021/spock
ufol=your-build-spock
mkdir $ufol
cd $ufol
cp $myfol/build-hipcc.sh .
# in the build script, modify path to your clone
# like this:
# HOMME=${YOUR_CLONE}/components/homme
# where YOUR_CLONE is the root folder of the E3SM clone
source $myfol/load-spock-modules
source build-hipcc.sh
mkdir run1
cd run1
mkdir movies
cp $myfol/run1/acme* .
cp $myfol/xx*nl .
cp $myfol/run.sh .
source run.sh
grep prim_main time*
EXAMPLE OUTPUT SPOCK:
grep prim_main *
time-preqx-xx-ne10-nmax500.298603:"prim_main_loop" - 1 1 1.000000e+00 6.152305e+01 61.523 ( 0 0) 61.523 ( 0 0)
which means the main part of the simulation took 62 seconds.
FOR SUMMIT:
myfol=/gpfs/alpine/cli115/scratch/onguba/for-support-nov09.2021/summit
build script is now build-summit.sh , otherwise instructions are the same.
EXAMPLE OUTPUT SUMMIT:
time-preqx-xx-ne10-nmax500.1616177:"prim_main_loop" - 1 1 1.000000e+00 2.649569e+01 26.496 ( 0 0) 26.496 ( 0 0)
NOTE that for Spock build I have to use an external build outside of E3SM submodules, but it seems that permissions are ok.
UPDATE:
Someone pointed to me that I was missing DNDEBUG in the flags (there were other flags suggested, too, but those did not make a difference). This flag shrank the timer 2x:
time-preqx-xx-ne10-nmax500.298603:"prim_main_loop" - 1 1 1.000000e+00 6.150645e+01 61.506 ( 0 0) 61.506 ( 0 0)
time-preqx-xx-ne10-nmax500.298621:"prim_main_loop" - 1 1 1.000000e+00 3.113658e+01 31.137 ( 0 0) 31.137 ( 0 0)
I pushed the change to the branch.
@twhite-cray -- if you have already read the original comment above, please, see the update at the end of it. It seems to be resolved. thanks!
Spock vs Summit, efficiency metric.
Calls to assert
in Hip kernels can be very expensive, even if they are always true. They turn on printf
support in the kernels, which uses up lots of registers and may cause register spilling. I think -DNDEBUG
turns off all the assert
calls.
@oksanaguba, I'm not in the project CLI115, so I can't read $myfol
. Could you copy the directory into /gpfs/alpine/cli115/world-shared/onguba
?
For the runs on your plot, does ne10
mean NP=10
? And what was the number of vertical levels, nlev
?
Thanks!
I copied the folder to the one you suggested. Builds are probably ruined, but I could re-do that. HIP kokkos build in in my home folder and it seems that it is readable (unless I missed something).
ne10 means 10 10 6 finite elements per mesh (this is a cube-sphere mesh with 6 sides and 10 elements per egde), NP=4 is the number of GLL points per element's edge (so, 16 GLL points per element). We usually do not change NP. GLL points are 'repeated', so neighbor elements have they own edge GLLs, which are then synchronized.
In these runs, number of vertical levels is 72.
thanks
@oksanaguba, I was able to reproduce the 31s run on Spock. Thanks!
I want to test some code optimizations. In preparation for testing, I tried inserting a bug (commented out compute_qtens(kv)
in run_tracer_phase
), but it didn't appear to change any of the output, neither standard out nor mass.out
.
Is there a change that I can make to the test so that it prints out results that I can compare for correctness?
This will slow down the run but it will print state of the model much more often. Change
statefreq=999999
disable_diagnostics = .true.
in namelists to
statefreq=1
disable_diagnostics = .false.
You should be able to see much more output, let me know if not.
Regarding correctness in general, this may require more extensive testing, with netcdf and eye ball norms. When you have suggestions, we can re-run and look.
That is, if you introduce nonbfb changes via optimization, there is no easy way to tell that the code remains correct (with some exceptions about conservation that the model prints out).
If you introduce bfb changes and the output remains identical, it is not guaranteed that the code is identical. What we do then is run 2 codes with netcdf output, and compare.
^^ updated my comment
@oksanaguba I'd like to start investigating the performance differences between V100 and MI250x. It appears that the directories you previously made available (in /gpfs/alpine/cli115/world-shared/onguba/for-support-nov09.2021/
) might have been scrubbed. I do not see any build-hipcc.sh
, run.sh
, etc. Would it be possible for you to re-populate this directory?
@mattdturner thanks, yes, the folder is gone. I reran everything in folder
/gpfs/alpine/cli115/world-shared/onguba/for-support-apr2022
Note that my kokkos build is not anymore available, but you could build using kokkos build script that is in the same folder. my kokkos repo's commit is outdated, i do not think it matters much.
i just ran my new hommexx build, timers are comparable except for the high workload case, ne20 (which is 20*36 finite elements), the performance degraded. I did it quick, so i might have missed something, or my binding options are not valid anymore.
Also, use the most recent commit for the branch oksanaguba/homme/spock .
So with ne20 there's no performance degradation? Does the M100 start to outperform the V100 if you try a larger problem?
There is degradation for ne20 run today on spock compared to a run from February on spock. For hommexx so far, MI100 is slower than V100, any results, from today and from February. Btw, ne20 does not fit on V100, too big.
In detail, the timers today are [5 7 8 14 32 71 139] for problem sizes [ne2 ne3 ne4 ne6 ne10 ne15 ne20], while before they were [5, 7, 8, 14, 31, 70, 121].
@twhite-cray if i remember correctly, you asked me to re-create homme benchmark runs on crusher? if so, updated instructions are below.
The folder is /gpfs/alpine/cli115/world-shared/onguba/for-support-june2022-crusher
(FOLDER).
The branch is oksanaguba/homme/spock
.
The commit there is
commit 41f0c42e4550e8d1782153a650659c38e21846f6 (HEAD -> oksanaguba/homme/spock, origin/oksanaguba/homme/spock)
Kokkos repo: branch develop, commit commit 24b3eb42ba72cc4ea4dae441916dd6db8dcb8613
.
Modules used for both kokkos build and homme build are in FOLDER, file load-crusher-modules
.
How to build kokkos -- in FOLDER look for file build-kokkos/build*sh
.
How then to build homme -- in FOLDER, look for file build-hipcc-preqx.sh
, there change HOMME variable to point to your clone and possibly point to your own kokkos install, variable E3SM_KOKKOS_PATH .
After building homme: make a directory run2, mkdir run2/movies, copy to run2 acme* files from FOLDER/run1, copy FOLDER/run.sh script too, if needed, point to your build of homme and to file with proper modules, submit run.sh.
run.sh uses 2 GCDs, 2 ranks. Please, let me know if there are questions.
@oksanaguba, yes, thanks! I won't be able to get to it this week, and I'm out of the office all next week, so I copied your for-support-june2022-crusher
to HPSS. I hope to take a deeper look at this early in July.
@twhite-cray noticed that two items were missing from instructions -- rename libkokkoscore.a to libkokkos.a and copy run1/xxinput.nl to run2.
I made some progress looking into performance of this case. A rocprof
run of ne20
with a single MPI task gave the following top kernels.
Kernel | Calls | Total time (s) | % of all kernel time |
---|---|---|---|
AALTracerPhase | 1503 | 31.6 | 21.7 |
CaarFunctorImpl | 2505 | 17.5 | 12.0 |
BIHPreNup | 501 | 15.2 | 10.5 |
BIHPostConstHV | 501 | 14.5 | 10.0 |
compute_qmin_qmax | 1503 | 12.9 | 8.9 |
I performed experiments reimplementing the top kernel, AALTracerPhase
. First, I tried to use a Kokkos::TeamPolicy
of size NP*NP
, which the hope that I could use set_chunk_size
to block it into 4 or 8 vertical levels. The commit of Kokkos that this version of Hommexx uses doesn't appear to implement blocking, however. Despite setting the chunk size, I always get workgroups of size 1x16x1, not the desired 1x16x4 or 1x16x8.
So I tried a TeamPolicy
of size warpSize
, and I partitioned it into TEAM_LEV*NP*NP
threads, where TEAM_LEV = warpSize/(NP*NP)
. For AMD GPUs like those on Crusher, TEAM_LEV = 4
. I then launched the kernel with ne*nq*NUM_LEV/TEAM_LEV
workgroups.
This configuration allows the following optimizations.
The following charts compare the performance of the original AALTracerPhase
kernel with my implementation using the new optimizations, all using a single MI250X GCD on Crusher. "Level major" means the threads in the team have the level as the fastest dimension. "Point major" means the threads in the team have the level as the slowest dimension.
These optimizations show a potential for 2x performance improvement in this kernel. You can find the code here: https://github.com/twhite-cray/E3SM/tree/twhite-cray/hommmexx-preqx/crusher
My implementation has some issues regarding production use.
TeamPolicy
with working chunking, which may be available in newer versions of Kokkos. The faster "Level major" implementation performs its reduction with the chunk as stride, which I don't think is supported by Kokkos hierarchical parallelism.I looked at the next kernel on the list, CaarFunctorImpl
. It does multiple scan reductions in the vertical, along with horizontal gradient computations. Here are a couple of ideas for TeamPolicy
optimizations.
NP*NP*NUM_LEV
= 1152 > 1024, use NP*NP*NUM_LEV/2
= 576. Each thread is thus responsible for two points. Local shared memory can help with both vertical scans and horizontal gradients.I'd like to discuss the performance results and code prototype from my previous comment before proceeding.
Good news: I was able to run this Hommexx test case with Kokkos 3.7. Bad news: Performance was the same, and Kokkos::TeamPolicy::set_chunk_size
still does nothing. I checked the Kokkos Slack, and it confirmed that this is the current expected behavior.
I got an error when I ran with limiter_option = 0
, so I compiled and ran AALTracerPhase
and the point-major version of my replacement with the limiter code eliminated at compile time using #if 0
. Here are the resulting kernel times.
The limiter is not as significant as I thought, but it is a larger fraction of the optimized time. In one of my previous experiments, I did see a significantly higher runtime from a "bad" implementation of the limiter on my part. I guess that led to my prejudice against it.
I was able to reproduce the performance of the "point major" implementation using only Kokkos. The strategy is to use "team" parallelism across a block of vertical levels and "vector" parallelism across each element. The filter uses parallel_reduce
with ThreadVectorRange
for the sums across element points. https://github.com/twhite-cray/E3SM/commit/ee749b3b6ec13bbb700d3d9f5aa6d9823845210c
I think these changes should be portable across native GPU implementations of Kokkos, but they may not work with OpenMP or OpenMP offload because they rely on all "vector" threads of the team having independent values for all local variables.
@twhite-cray -- thanks! Could I please re-iterate some things we have talked about: -- Is it correct that the current impl of the limiter 8 is not good for HIP, therefore, your changes are needed? -- Is it correct that the limiter 8 does not take too much time overall wrt the total timer? (i haven't tried that run myself yet). -- Is there an insight from that you have done for the limiter about why the rest of the code is not as performant on HIP?
We (or I) can look again at which parts of the code are particularly slow on crusher compared to summit. Unrelated to your improvements, i still need to try 2 more suggestions about homme HIP performance.
thanks again!
-- Is it correct that the current impl of the limiter 8 is not good for HIP, therefore, your changes are needed? -- Is it correct that the limiter 8 does not take too much time overall wrt the total timer? (i haven't tried that run myself yet).
The limiter code itself does not appear to be the cause of the relative slowdown between the existing code and mine. The plot above shows that little time was gained by turning it off.
The performance improvement likely comes from a combination of the following.
NUM_LEV/4
threads (for a total of elements x tracers x NUM_LEV/4
).*this
.-- Is there an insight from that you have done for the limiter about why the rest of the code is not as performant on HIP?
I've been looking at CaarFunctorImpl
, and I think it could benefit from a different TeamPolicy
, with teams of 16 with warpSize
vectors (32 in Cuda, 64 in Hip). The idea is to maximize the parallelism in the vertical dimension while still allowing fast Kokkos scan operations. Other kernels that rely on scans in the vertical may benefit from this as well.
AALTracerPhase
has no vertical scans, so it didn't need such a change. The sums in the limiter favored Kokkos vectors in the horizontal instead.
Here's an update on optimizations for CaarFunctorImpl
. The code changes start here:
https://github.com/twhite-cray/E3SM/blob/babeafcd22250d881cb015c71a3f5a5a06538165/components/homme/src/preqx_kokkos/cxx/CaarFunctorImpl.hpp#L217
First, I refactored the kernel into a single lambda, mostly so I could see what is going on in the code. This refactoring included the following optimization strategies.
rsplit != 0
and n0_qdp < 0
. If either of these conditions are false, the original kernel is used.operator()
with *this
. The goal was to limit the data structures copied to each workgroup, instead of always copying the entire CaarFunctorImpl
object. I don't know if this optimization was important for performance, but it did help me to see dependences within the code.The above summarizes the initial optimizations implemented before I did any performance comparisons. Then I performed a series of refinements to reduce the use of buffers in global memory, by moving them to Kokkos scratch or local variables. The move of global variables to local is an important optimization for GPUs that is simplified by the monolithic-kernel approach. Attempts to make kernels more modular through function calls should carefully consider how to enable use of scratch and local variables.
Here are timings comparing the original kernel functor for CaarFunctorImpl
, "Original", my initial refactored kernel, "Initial", and my current refactored kernel, "Current". The times are from the GPTL "caar compute" timer for a single-GCD ne20
run on Crusher.
One or more additional conversions of global buffers to scratch may be possible, but Crusher is undergoing maintenance today, and I wanted to go ahead and update my progress.
We noticed that on spock we cannot achieve expected on-node performance using our code, HOMME, compared with summit on-node performance.
We prepared a few simple examples which also do not show good performance on spock. Examples use cuda/hip code or kokkos library. Below are sample outputs from examples, with timers of interest in bold. Kokkos examples have a timer for without access to device memory (‘f1nd’) and with it (‘f1’). Below we also describe where examples are located on summit and how they were built.
SUMMIT raw cuda: N is 1048576 Size of Real: 8 Timer local : start 774704320, stop 774704800, diff 8.8184003829956054687500 Timer with comm : start 774528432, stop 774703840, diff 10.3657598495483398437500
SPOCK raw hip: N is 1048576 Size of Real: 8 Timer local : start 7837664, stop 7837856, diff 11.8738842010498046875000 Timer with comm : start 7837280, stop 7837472, diff 15.8701896667480468750000
SUMMIT kokkos: n 128000 m 256 f1 : 3.709793e-04 F1 check: 5.2252371674778481e+09 f1nd: 1.130819e-03 F1 check: 5.2252371674778481e+09
SPOCK kokkos: n 128000 m 256 f1 : 8.769035e-04 F1 check: 5.2252371674778481e+09 f1nd: 1.976013e-03 F1 check: 5.2252371674778481e+09
Examples are located in my home folder:
Raw cuda/hip examples have output of used modules, env, and a build file (makeme):
Kokkos examples, using the same environment, contain kokkos build folder with a script to build ( script is kokkosbld/makescript.sh, which should be followed by ‘make install’)
and instructions on how to build in readme file.