MeasureTransport / MParT

Monotone Parameterization Toolkit (MParT): A core library for constructing and using transport maps.
https://measuretransport.github.io/MParT/
BSD 3-Clause "New" or "Revised" License
13 stars 4 forks source link

Create Preliminary GPU bindings #150

Open dannys4 opened 2 years ago

dannys4 commented 2 years ago

As discussed in #149, we need to add bindings such that we can do GPU-powered maps in the bound languages. I'm thinking the way we do this is to make a function like Wrap<MemorySpace>(arr) or something, for each of the bindings, that takes in an host-language array (e.g. numpy, matlab, julia array) and spits out a wrapped Kokkos::View. Then, we can throw that into whatever functions we want (e.g. ConditionalMapBase, etc) without needing to change the base library. Then, we could template the binding functions (e.g. ConditionalMapBaseWrapper() for python) based on memory space, so we only have to write one set of bindings (hopefully!) for both memory spaces.

If someone puts in a raw host-language array, then I suppose we just assume that they mean to use Kokkos::HostSpace.

mparno commented 2 years ago

Also check out the relatively new StridedMatrix and StridedVector aliases in ArrayConversions.h. You'd probably want to expose StridedMatrix<double,Kokkos::HostSpace> and StridedMatrix<double, Kokkos::DefaultExecutionSpace::memory_space> as well as the StridedVector versions of this. Note that Kokkos::HostSpace and Kokkos::DefaultExecutionSpace::memory_space will be the same if Kokkos was not compiled with device support (e.g., without either cuda or sycl).

The ToDevice and ToHost functions in ArrayConversions.h might also be useful in the implementation of this.

mparno commented 2 years ago

@dannys4 Were you imagining exposing two version of each class: one that evaluates on Host and one that evaluates on Device? That would allow users to have a control over where the evaluation occurs, which might be nice since the CPU evaluation will likely be faster for small batch sizes because it avoids the host->device->host copies.

dannys4 commented 2 years ago

Also check out the relatively new StridedMatrix and StridedVector aliases in ArrayConversions.h

I saw that PR and it makes sense to use those. I'll have to dig deeper once I get the chance.

The ToDevice and ToHost functions in ArrayConversions.h might also be useful in the implementation of this.

This was my thought as well!

Were you imagining exposing two version of each class?

I think we're on the same page here, but "each class" is a little ambiguous-- I was thinking of just wrapping everything that was templated with something like MemorySpace in the host space and (if the Kokkos_ENABLE_CUDA option is ON) the device space (if not ON, then just throw an error when you call toDevice). I'm trying to minimize data movement so that it's not just continually and pointlessly copying between CPU and GPU. However, we should only have to wrap those things once in some templated function like ConditionalMapBaseWrapper<MemorySpace>(), then we can just call that method twice I would hope. Maybe this illuminates your comments/questions?

mparno commented 2 years ago

185 Added GPU support to the python bindings.

mparno commented 2 years ago

After some group discussions last week, we are going to push the Julia and Matlab bindings into the post-joss milestone.