Use TeamVectorRange. - Githubissues

ambrad commented 1 year ago

Use TeamVectorRange, which was not available when the project first started. This will clarify the threading intent in column physics (which currently uses TeamThreadRange with the assertion that vector size is 1) and remove architecture-dependent workarounds in the tridiag solver and GLL-FV remap.

tcclevenger commented 1 year ago

@ambrad I can pick this up, but I do have some questions.

Is it correct that we want to use TVR instead of TTR as top level execution within a team? For all of SCREAM? Where in SCREAM is vector size asserted to be 1?

You bring up Gll-fv remap, is this change for Homme as well? The only thing I'm finding there that is architecture-dependent is:

388    // 2D -> 1D thread index space. Need to make vector dim the faster for                                                  
389    // coalesced memory access on the GPU. Kokkos doesn't expose the number of                                              
390    // threads in a team, so we have to go to the lower-level API here.                                                     
391    const int nthr_per_team =
392#if defined __CUDA_ARCH__ || defined __HIP_DEVICE_COMPILE__
393      blockDim.x,
394#else
395      1,
396#endif
397      team_niter = (niter + nthr_per_team - 1)/nthr_per_team;
398    assert(OnGpu<ExecSpace>::value || nthr_per_team == 1);
399    parallel_for(Kokkos::TeamThreadRange(team, team_niter), [&] (const int team_idx) {
400      parallel_for(Kokkos::ThreadVectorRange(team, nthr_per_team), [&] (const int vec_idx) {
401        const int k = team_idx*nthr_per_team + vec_idx;
402        if (k >= niter) return;
403        fn(k);
404      }); });
405  }

For the tridiag solver, the only parallel_fors within a team that I'm seeing is for bfb() calls. I see architecture-dependent code for getting the number of threads and so on, but they are only used in serial for() statements in cr() and thomas(). I need to understand better what is happening with these solves.. Also, a bit of a tangent, but why does homme reference scream_tridiag.hpp but shoc uses ekat_tridiag.hpp. Seems like these files do the same thing...

ambrad commented 1 year ago

Yes. Note we can't use the abbrev TVR anymore since it could mean Team- or ThreadVectorRange. Let's abbreviate TeVR and ThVR. Not sure if there's an explicit assertion anywhere. That might appear only in our very old P3 micro-app.
Yes, that's the spot in GllFvRemapImpl that can be switched to TeVR. In general our hand-rolled TeVRs are always as isolated as possible, so it's good you see only one spot in GllFvRemap.
Let's skip the tridiag solver for now. I'll look at it myself one day. It's true that it doesn't use any hierarchical parallel_* at all. I think it can use TeVR inside the while loops, but I'd want to study performance effects before doing that. It's also possible the TeVR impl has a utility inside Kokkos::Impl (or something like that) that would give the thread ID. That would be the best thing. That would then replace the current get_thread_id_within_team_gpu impls in the tridiag solver.

bartgol commented 1 year ago

Re: tridiag duplication. Homme needs the tridiag solver, even outside of eamxx. Since e3sm does not have EKAT, we copied the file over to homme.

ambrad commented 1 year ago

Thanks @bartgol. I forgot to answer that question.

tcclevenger commented 1 year ago

ahh yes, I feel like I've asked that question before now. E3SM doesn't know about ekat. Hopefully I'm done asking that one.

tcclevenger commented 1 year ago

@ambrad Here is the update discussed in the scream meeting:

All the TeamThreadRange can be replaced with TeamVectorRange trivially, outside of one instance in scream_column_ops.hpp:790:

Kokkos::parallel_scan(Kokkos::TeamThreadRange(...),...);

Turns out, replacing with TeamVectorRange in a parallel_scan compiles and runs on CPU, but on GPU it is not implemented.

After a quick look, I can't figure out why it work even for CPU (can't find the impl). Maybe it is implemented for GPU, but I am not calling a correct function signature.

I had put this on hold, since as part of my work with Kokkos, I am updating unit_tests and I'm about to start with parallel_scan with team policies. So I'll have to figure out if/if-not-why this is/isn't implemented. For now I could submit a PR changing all other instances.

tcclevenger commented 1 year ago

Mystery solved: TeamVectorRange returns TeamThreadRangeBoundariesStruct on CPU, but not on Cuda. I'll create a PR with all other instances changed. @ambrad Should this have a performance impact?

ambrad commented 1 year ago

What you write about the scan impl is quite plausible; often the team-level scan doesn't support everything that parallel-for and even reduce do. A PR for the physics replacing what can be replaced easily is a good idea. You can then put off the rest until later. I can handle the tridiag and Hommexx side at some point.

The key thing we're trying to accomplish with this task is to make clear in the code that we're not doing nested loops, and doing just the easy stuff strongly supports that goal.

Re: performance, when working on performance-critical code, you should always check performance with an ne30pg2 v1 case on Summit, Ascent, PM-GPU, or Crusher before and after and report the results in the PR or the issue. See https://github.com/E3SM-Project/scream/issues/1026#issuecomment-1378108313 for an example.

ambrad commented 1 year ago

@ambrad Should this have a performance impact?

To be clear: No. The performance of the kernels should be exactly the same.

tcclevenger commented 1 year ago

There are a couple of places in physics where we do have nested loop. E.g.,

shoc_update_prognostics_implicit_impl.hpp:150:

Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev), [&] (const Int& k) {
  Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, num_qtracers), [&] (const Int& q) {
      qtracers_rhs_s(k, q) = qtracers_s(q, k);
  });
});

Should these stay the same? I believe you can use TeVR - ThVR, but is that useful?

ambrad commented 1 year ago

Leave those as is. I should try to clarify the context and purpose of this task a bit more. We started SCREAM before TeamVectorRange was available. But we wanted a flat loop. So we use(d) TeamThreadRange with no inner ThreadVectorRange. Technically, this is incorrect usage, because if the vector dimension is > 1, then the computation is meaningless. But we force the vector dimension to be 1, making it OK. Now that TeamVectorRange is available, we want to use it to codify clearly that we mean one flat loop, when we mean that. When we have nesting, we of course want to keep the nesting.

ambrad commented 1 year ago

You might check whether EKAT inherited any of SCREAM's flat-loop usage convention. If so, fix those, too, please.

tcclevenger commented 1 year ago

@ambrad Looking at ekat, in linear interp code, there is a

parallel_for(TeamThreadRange, [&]() { vector_simd for ... });

Can TeamVectorRange be used instead, or will that interfere with vector_simd, since TeamVectorRange uses vector lanes? I am not familiar yet with vector_simd tag..

Tagging @bartgol @jgfouca since they wrote the EKAT stuff..

jgfouca commented 1 year ago

@tcclevenger , I'm not seeing vector_simd anywhere in the lin_interp code?

ekat>% git grep -l vector_simd
src/ekat/ekat_macros.hpp
src/ekat/ekat_pack.hpp
src/ekat/ekat_pack_kokkos.hpp
src/ekat/ekat_pack_math.hpp

ambrad commented 1 year ago

TeamVectorRange can be used if there is no internal ThreadVectorRange. vector_simd is just a pragma synonym for use in Intel.

tcclevenger commented 1 year ago

@jgfouca lin_interp_impl() contains a call to ekat_masked_loop() which is a directive for vector_simd for

ambrad commented 1 year ago

I believe @tcclevenger is done with most of this. I've assigned myself to some little remaining things in the tridiag solver and gllfvremap_mod.

ambrad commented 1 week ago

Closing this. Will open another for tridiag solver.

E3SM-Project / scream

Use TeamVectorRange. #2070