eyalroz / cuda-api-wrappers

Thin, unified, C++-flavored wrappers for the CUDA APIs
BSD 3-Clause "New" or "Revised" License
768 stars 79 forks source link

Switch memory functions to only work with regions (and spans) #322

Open eyalroz opened 2 years ago

eyalroz commented 2 years ago

Look at our modified vectorAdd example. It's certainly nicer than the original, but it's just sad that we have to repeat ourselves again and again with respect to lengths and sizes:

int numElements = 50000;
size_t size = numElements * sizeof(float); 
// ... snip ...
std::generate(h_A.get(), h_A.get() + numElements, generator);
// ... snip ...
cuda::memory::copy(d_A.get(), h_A.get(), size);
// ... snip ...
cuda::memory::copy(h_C.get(), d_C.get(), size);

instead of being able to say:

auto h_A = make_unique_span<float>(numElements);
std::genertate(std::begin(h_A), std::end(sh_A), generator);
cuda::memory::copy(d_A, h_A);
// ... snip ...
cuda::memory::copy(h_C, d_C);

This is in no small part due to our use of standard-library-like unique pointers rather than unique regions (see issue #291 ); and that in turn is partially motivated by how our memory functions are mostly pointer-oriented, not region-oriented. Now that we have region classes - why should we not just turn away from the raw CUDA API style of passing pointers around, and instead only work with sized ranges? (And perhaps also spans, for C++20; or even bundle a span implementation for older distros?)

bgianfo commented 11 months ago

This would be a great feature for memory safety, one I actually had hoped already existed but I ended up here realizing that it's not yet available. :)

eyalroz commented 11 months ago

@bgianfo : So, in your opinion, is it worth it to move away somewhat from the convention of the standard library - std::unique_ptr's and support for pointers - in favor of "unique regions" (i.e. basically spans with ownership)?

bgianfo commented 11 months ago

To give some context I am fairly new to the library, but I'm working on updating an older project and have been having good success so far. I think for the case where we copy a lot of arrays back and forth between device and host memory, I would like a way of using the type system to not have to keep passing the size of buffers in bytes around. I think having that information would also enable the library to even potentially catch corruptions in copy's before they happen.

It sounds like your proposed "unique region" solution would solve this problem nicely.

I think it would also enable a nice ergonomic enhancement where you could allocate host or device memory, and do the copy back to host/device with one call (essentially a "region clone"), something like:

constexpr int mat_size = 100;
auto h_mat_1 = cuda::memory::host::make_unique<float[]>(mat_size);
auto h_mat_2 = cuda::memory::host::make_unique<float[]>(mat_size);

populate_matrices(h_mat_1.get(), h_mat_2.get(), mat_size);

auto d_mat_1 = cuda::memory::device::make_unique<float[]>(device, mat_size);
auto d_mat_2 = cuda::memory::device::make_unique<float[]>(device, mat_size);
cuda::memory::copy(d_mat_1.get(), h_mat_1.get(), mat_size * sizeof(float));
cuda::memory::copy(d_mat_2.get(), h_mat_2.get(), mat_size * sizeof(float));

Could be simplified to with some helper method to something like this, if we were able to retain information about region sizes + types:

constexpr int mat_size = 100;
auto h_mat_1 = cuda::memory::host::make_unique_region<float[]>(mat_size);
auto h_mat_2 = cuda::memory::host::make_unique_region<float[]>(mat_size);

populate_matrices(h_mat_1.get(), h_mat_2.get(), mat_size);

auto d_mat_1 = h_mat_1.clone_to_device(device);
auto d_mat_2 = h_mat_2.clone_to_device(device);

The same thing could be applied to memory zeroing:

constexpr int mat_size = 100;
auto d_mat_1 = cuda::memory::device::make_unique_region<float[]>(device, mat_size);
d_mat_1.zero();
eyalroz commented 11 months ago

Well, I'll first say that I need feedback from at least a few users, including those with more experience than you describe, to make a decision...

Secondly, let me play the "devil's advocate" here. You write:

I would like a way of using the type system to not have to keep passing the size of buffers in bytes around.

but you already have that, with either cuda::span<T> or cuda::memory::region_t:

constexpr int mat_size = 100;
auto h_mat_1 = cuda::memory::host::make_unique<float[]>(mat_size);
auto h_mat_2 = cuda::memory::host::make_unique<float[]>(mat_size);
auto h_span_1 = cuda::span<float> { h_mat_1.get(), mat_size };
auto h_span_2 = cuda::span<float> { h_mat_2.get(), mat_size };

populate_matrices(h_span_1, h_span_2);

auto d_mat_1 = cuda::memory::device::make_unique<float[]>(device, mat_size);
auto d_mat_2 = cuda::memory::device::make_unique<float[]>(device, mat_size);
auto d_span_1 = cuda::span<float> { d_mat_1.get(), mat_size };
auto d_span_2 = cuda::span<float> { d_mat_2.get(), mat_size };

cuda::memory::copy(d_span_1, h_span_1);
cuda::memory::copy(d_span_2, h_span_2);

It's a bit verbose, but - is it not good enough? This is the dilemma.

Note that I could, in theory, define a "unique region" class and not drop all of the pointer-and-address API functions (for copying and allocation). That's a third option, which has the benefit of letting everyone do exactly what they want, but the detriment of not promoting safety and correctness, and enlarging the already huge set of variants of each of the copy, set, allocate etc. functions

eyalroz commented 6 months ago

@bgianfo : Please have a look at the unique-regions branch. It has a make_unique_span() for a typed owning contiguous container, and make_unique_region for an untyped owning version of cuda::memory::region_t.

bgianfo commented 6 months ago

I looked through the branch it looks great IMHO. In retrospect the unique_span is exactly what I was hoping for and it nicely solves the issues I was concerned about (ergonomics + verbosity + opportunity for bugs from typos from pointer + size mismatch). I appreciate the time and effort to flesh this out. Thank you!

eyalroz commented 6 months ago

@bgianfo : I'm actually worried that it's a bit too much to have all of:

and that maybe I should drop at least one of those. Not to mention how we also have cuda::memory::pointer class, which could in theory be merged with at least one of those.