Allow building with Cuda UVM and run select kernels on host.

bartgol commented 2 years ago

Logging ideas that sparked in a conversation with @ndkeen .

During debug/development on GPU, it might be helpful to run a handful of kernels on Host (rather than device), to allow bisecting where an error is generating. Currently, on GPU, we use the native Cuda memory space, which means data is not accessible on host. However, CudaUVM should allow that.

To allow running a particular kernel on host, we should allow to build scream with CudaUVM (if explicitly requested). CudaUVM might be slower than native Cuda space, but we would use this just to debug, and memory space should not have any impact on the actual numbers generated (that is, Cuda should be bfb with CudaUVM). Once we verify things work correctly, we can try to swap policies:

// This is the "normal" policy
// const auto policy = ekat::ExeSpaceUtils<ExeSpace>::get_default_team_policy(nj, nk_pack);
const auto policy = ekat::ExeSpaceUtils<HostDevice::ExeSpace>::get_default_team_policy(nj, nk_pack);

Kokkos::parallel_for (
  "my_kernel",
  policy,
  KOKKOS_LAMBDA(const MemberType& team) {
  ...
});

Ideally, that should be enough to force my_kernel to run on CPU. This might be a view a tad too naive, but I think it should work...more or less. It is basically what YAKL does for BFB runs.

ndkeen commented 2 years ago

Any more ideas about this? In this issue we are talking about allowing certain kernels to run on CPU instead of device. But what about allowing certain kernels to maybe still run on GPU, but in some super safe (and slow) manner? Perhaps serially?

bartgol commented 2 years ago

The concept of "serial" is a bit fuzzy. We do not have any issue with iterations of ||for loops be run in ||. The issue is more likely related to concurrent writes or operations that require threads to exchange information (such as in || reductions).

Right now, we have a very limited set of reductions/scans in scream. We already have the ability to perform reductions serially, provided that they are reduction over a single column, handled by a single team of threads. I am not sure if we have the impl for serializing || scans (I don't think so, but we could add it).

In all other cases, I don't see the point of making a simple || for (like a[i] = b[i] + c[i]) run serially. Besides, I don't even know if we can manage to force only one thread on the whole GPU to run... We can force 1 thread per team (aka block). Since each block is usually running on a single column, and no two blocks work on the same column, and (at least in physics) we don't have operations that couple columns (except for MPI, but that's outside kernels), I do think this should be enough.

That said, we could provide wrappers to lambdas (maybe in ekat), that accept an additional bool input, specifying whether the execution of the lambda has to be serialized (i.e., wrapped in a Kokkos::single).

bartgol commented 2 years ago

This is more complicated than I thought. Besides some typedefs that can be quickly fixed, the biggest hurdle comes from the following issue:

Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::Cuda::memory_space> a("",1);
Kokkos::View<int*, Kokkos::LayoutRight, Kokkos::HostSpace> b;
b = Kokkos::create_mirror_view(a); // ERROR!

The reason it errors out is that a host mirror of a view using CudaUVM is...a view using CudaUVM (since it allows to access from host). However, a CudaUVM view is not assignable to a host view, I think b/c the deallocation cannot be performed by the host memory space (only CudaUVM knows how to deallocate, via cudaFree).

In some places, the restriction is easy to overcome. However, some places do not define host data as the mirror view of a device counterpart, but rather as a view templated on some exec space. Something like this:

template<typename MemSpace>
struct MyData {
  Kokkos::View<int*,MemSpace> a;
};

MyData<ExecSpace::memory_space> my_data_dev;
MyData<HostSpace> my_data_host;

The two data struct are not compatible on host, in the sense that the host mirror of my_data_dev.a cannot be assigned to my_data_host.a.

I have to think about how to handle this. Currently, we have a few places in scream, as well as a few in Homme that follow the pattern above.

AaronDonahue commented 2 weeks ago

@bartgol , is this something we still want to do?

bartgol commented 2 weeks ago

@AaronDonahue ideally, yes. But it's not a small task, and I am not sure how big of a priority it is. I'll slap the wishilist label on it, and we'll revisit later.

E3SM-Project / scream

Allow building with Cuda UVM and run select kernels on host. #1572