Closed ambrad closed 1 week ago
@ambrad I can pick this up, but I do have some questions.
388 // 2D -> 1D thread index space. Need to make vector dim the faster for
389 // coalesced memory access on the GPU. Kokkos doesn't expose the number of
390 // threads in a team, so we have to go to the lower-level API here.
391 const int nthr_per_team =
392#if defined __CUDA_ARCH__ || defined __HIP_DEVICE_COMPILE__
393 blockDim.x,
394#else
395 1,
396#endif
397 team_niter = (niter + nthr_per_team - 1)/nthr_per_team;
398 assert(OnGpu<ExecSpace>::value || nthr_per_team == 1);
399 parallel_for(Kokkos::TeamThreadRange(team, team_niter), [&] (const int team_idx) {
400 parallel_for(Kokkos::ThreadVectorRange(team, nthr_per_team), [&] (const int vec_idx) {
401 const int k = team_idx*nthr_per_team + vec_idx;
402 if (k >= niter) return;
403 fn(k);
404 }); });
405 }
bfb()
calls. I see architecture-dependent code for getting the number of threads and so on, but they are only used in serial for()
statements in cr()
and thomas()
. I need to understand better what is happening with these solves.. Also, a bit of a tangent, but why does homme reference scream_tridiag.hpp
but shoc uses ekat_tridiag.hpp
. Seems like these files do the same thing...parallel_*
at all. I think it can use TeVR inside the while loops, but I'd want to study performance effects before doing that. It's also possible the TeVR impl has a utility inside Kokkos::Impl (or something like that) that would give the thread ID. That would be the best thing. That would then replace the current get_thread_id_within_team_gpu impls in the tridiag solver.Re: tridiag duplication. Homme needs the tridiag solver, even outside of eamxx. Since e3sm does not have EKAT, we copied the file over to homme.
Thanks @bartgol. I forgot to answer that question.
ahh yes, I feel like I've asked that question before now. E3SM doesn't know about ekat. Hopefully I'm done asking that one.
@ambrad Here is the update discussed in the scream meeting:
All the TeamThreadRange
can be replaced with TeamVectorRange
trivially, outside of one instance in scream_column_ops.hpp:790
:
Kokkos::parallel_scan(Kokkos::TeamThreadRange(...),...);
Turns out, replacing with TeamVectorRange
in a parallel_scan compiles and runs on CPU, but on GPU it is not implemented.
After a quick look, I can't figure out why it work even for CPU (can't find the impl). Maybe it is implemented for GPU, but I am not calling a correct function signature.
I had put this on hold, since as part of my work with Kokkos, I am updating unit_tests and I'm about to start with parallel_scan with team policies. So I'll have to figure out if/if-not-why this is/isn't implemented. For now I could submit a PR changing all other instances.
Mystery solved: TeamVectorRange
returns TeamThreadRangeBoundariesStruct
on CPU, but not on Cuda. I'll create a PR with all other instances changed. @ambrad Should this have a performance impact?
What you write about the scan impl is quite plausible; often the team-level scan doesn't support everything that parallel-for and even reduce do. A PR for the physics replacing what can be replaced easily is a good idea. You can then put off the rest until later. I can handle the tridiag and Hommexx side at some point.
The key thing we're trying to accomplish with this task is to make clear in the code that we're not doing nested loops, and doing just the easy stuff strongly supports that goal.
Re: performance, when working on performance-critical code, you should always check performance with an ne30pg2 v1 case on Summit, Ascent, PM-GPU, or Crusher before and after and report the results in the PR or the issue. See https://github.com/E3SM-Project/scream/issues/1026#issuecomment-1378108313 for an example.
@ambrad Should this have a performance impact?
To be clear: No. The performance of the kernels should be exactly the same.
There are a couple of places in physics where we do have nested loop. E.g.,
shoc_update_prognostics_implicit_impl.hpp:150:
Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev), [&] (const Int& k) {
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, num_qtracers), [&] (const Int& q) {
qtracers_rhs_s(k, q) = qtracers_s(q, k);
});
});
Should these stay the same? I believe you can use TeVR - ThVR, but is that useful?
Leave those as is. I should try to clarify the context and purpose of this task a bit more. We started SCREAM before TeamVectorRange was available. But we wanted a flat loop. So we use(d) TeamThreadRange with no inner ThreadVectorRange. Technically, this is incorrect usage, because if the vector dimension is > 1, then the computation is meaningless. But we force the vector dimension to be 1, making it OK. Now that TeamVectorRange is available, we want to use it to codify clearly that we mean one flat loop, when we mean that. When we have nesting, we of course want to keep the nesting.
You might check whether EKAT inherited any of SCREAM's flat-loop usage convention. If so, fix those, too, please.
@ambrad Looking at ekat, in linear interp code, there is a
parallel_for(TeamThreadRange, [&]() { vector_simd for ... });
Can TeamVectorRange
be used instead, or will that interfere with vector_simd
, since TeamVectorRange
uses vector lanes? I am not familiar yet with vector_simd
tag..
Tagging @bartgol @jgfouca since they wrote the EKAT stuff..
@tcclevenger , I'm not seeing vector_simd
anywhere in the lin_interp code?
ekat>% git grep -l vector_simd
src/ekat/ekat_macros.hpp
src/ekat/ekat_pack.hpp
src/ekat/ekat_pack_kokkos.hpp
src/ekat/ekat_pack_math.hpp
TeamVectorRange can be used if there is no internal ThreadVectorRange. vector_simd is just a pragma synonym for use in Intel.
@jgfouca lin_interp_impl()
contains a call to ekat_masked_loop()
which is a directive for vector_simd for
I believe @tcclevenger is done with most of this. I've assigned myself to some little remaining things in the tridiag solver and gllfvremap_mod.
Closing this. Will open another for tridiag solver.
Use TeamVectorRange, which was not available when the project first started. This will clarify the threading intent in column physics (which currently uses TeamThreadRange with the assertion that vector size is 1) and remove architecture-dependent workarounds in the tridiag solver and GLL-FV remap.