alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
355 stars 74 forks source link

abstract the memory access inside kernels #38

Closed BenjaminW3 closed 1 year ago

BenjaminW3 commented 9 years ago

There should not be direct access to memory buffers. This always implies knowledge about the memory layout (row or col) which is not necessarily correct on the underlying accelerator.

j-stephan commented 4 years ago

Reviving this ancient issue.

There should not be direct access to memory buffers.

I agree! SYCL introduced memory accessors which hide the pointers quite nicely and provide a clean API for memory access. Maybe we can adapt this concept. CC @bernhardmgruber, this might tie into his work.

j-stephan commented 3 years ago

I've been giving this some thought since this is (kind of...) a requirement for the SYCL backend. Here are some ideas I'd like your opinions on:

  1. Replace alpaka::allocBuf with alpaka::makeBuf. makeBuf will then have the following overloads: 1.1 makeBuf<TElem, TIdx>(TDev const& dev, TExtent const& extent) (same as current allocBuf) 1.2 makeBuf<TElem, 1>(TDev const& dev, InputIterator first, InputIterator last) - create a buffer from a standard C++ container (or everything else that can provide an InputIterator). 1.3 makeBuf<TElem, TIdx>(TDev const& dev, TElem* ptr, TExtent const& extent, bool useHostPtr) - create a buffer from a host pointer. Back-ends supporting direct usage of host memory may set useHostPtr to true in which case there will be no device-side allocation. Instead, the pointer is used directly. 1.4 makeBuf<TElem, TIdx>(TBuf otherBuf, TOffset const& offset, TExtent const& extent) - create a sub-buffer from an existing buffer on the same device. The sub-buffer will refer to the same (parts of) memory as the original buffer, thus no additional allocation is taking place. This also means that the dimensionality of the sub-buffer must be lesser than or equal the original dimensionality and extent cannot exceed the original extent.

  2. Introduce alpaka::slice for creating a sub-buffer. This is essentially an alias for 1.4

  3. Introduce alpaka::getHostView for accessing buffers on the host. This will generate an alpaka::view which can be used on the host. This is only possible for devices that support host-side memory access.

  4. Replace alpaka::getPtrNative with alpaka::require for passing (sub-)buffers as kernel parameters. This will generate an alpaka::view to global or constant memory which can be used inside the kernel.

  5. If the kernel requires shared memory, the user must call alpaka::declareSharedView<TElem, TDim> as parameter during kernel creation: alpaka::createTaskKernel(..., alpaka::declareSharedView<TElem, TDim>(TExtent const& extent)). This will generate an alpaka::view to shared memory which can be used inside the kernel.

  6. The current contents of alpaka/include/mem/view will be removed.

  7. The current contents of alpaka/include/block/shared will be removed. This includes a complete removal of static shared memory.

  8. Introduce an alpaka::view datatype which acts as memory access abstraction inside the kernel (and possible on the host). This would look somewhat like this:

    
    enum class viewMode
    {
    ReadOnly,
    WriteOnly,
    ReadWrite
    };

enum class viewTarget { HostMemory, GlobalMemory, ConstantMemory, SharedMemory };

template <typename TElem, typename TDim, typename TIdx, alpaka::viewMode mode, alpaka::viewTarget target> class view { using value_type = / TElem or TElem const /; using reference = / TElem& or TElem const& /; using pointer = / Backend-defined /; // and const_pointer using iterator = / Backend-defined /; // and const_iterator using reverse_iterator = / Backend-defined /; // and const_reverse_iterator / Constructors, copy / move operators, destructor / reference operator[](/ ... /); Vec get_extent(); pointer get_pointer(); // if you really need a raw pointer auto get_byte_distance(); // = pitch / more utility functions / };



What do you think?
bussmann commented 3 years ago

@bernhardmgruber , this is probably a critical interface discussion that needs careful discussion. LLAMA should be able to hook into this seamlessly, while Alpaka should work comfortably without the need for LLAMA here.

bernhardmgruber commented 3 years ago
  1. Replace alpaka::allocBuf with alpaka::makeBuf. makeBuf will then have the following overloads:

What is the rational for renaming the function? I think allocBuf is a good name.

1.2 makeBuf<TElem, 1>(TDev const& dev, InputIterator first, InputIterator last) - create a buffer from a standard C++ container (or everything else that can provide an InputIterator).

That overload should just take a range. And I think it is worthwhile do distinguish between the iterator concepts input, forward, random access and contiguous (new in C++20) iterators.

1.3 makeBuf<TElem, TIdx>(TDev const& dev, TElem* ptr, TExtent const& extent, bool useHostPtr) - create a buffer from a host pointer. Back-ends supporting direct usage of host memory may set useHostPtr to true in which case there will be no device-side allocation. Instead, the pointer is used directly.

This overload is a special case and might deserve its own, differently named function, e.g. adoptBuf, because it does not allocate. But a more profound question: is this even supported across all backends? Can CUDA reuse memory for buffers allocated using the CRT, i.e. malloc? I know OpenCL has such a feature, so i assume SYCL has it as well.

1.4 makeBuf<TElem, TIdx>(TBuf otherBuf, TOffset const& offset, TExtent const& extent) - create a sub-buffer from an existing buffer on the same device. The sub-buffer will refer to the same (parts of) memory as the original buffer, thus no additional allocation is taking place. This also means that the dimensionality of the sub-buffer must be lesser than or equal the original dimensionality and extent cannot exceed the original extent.

I think we need to be careful to not reinvent the wheel. Some thought has been poured into viewing memory or parts of it. That's why C++17 got string_view and C++20 got span<T>. There is also the proposal for mdspan<T, ...>, which merged from Kokkos views. These are good facilities and I would try to use them before duplicating their features on alpaka buffers.

  1. Introduce alpaka::slice for creating a sub-buffer. This is essentially an alias for 1.4

I think we only need one API to create a sub buffer. So either 1.4 or 2.

  1. Introduce alpaka::getHostView for accessing buffers on the host. This will generate an alpaka::view which can be used on the host. This is only possible for devices that support host-side memory access.

What is the benefit of this new API? I can call auto p = getPtrNative(buffer); and use p[i] just nicely. If the buffer is multidimensional, than I can call: auto view = std::mdspan<T, std::dynamic_extent, std::dynamic_extent>(getPtrNative(buffer), width, height); and then call view(x, y);. I know mdspan is not standardized yet, but there is a production ready implementation in the Kokkos repository.

  1. Replace alpaka::getPtrNative with alpaka::require for passing (sub-)buffers as kernel parameters. This will generate an alpaka::view to global or constant memory which can be used inside the kernel.

I think require is a bad name, because it is used in several other contexts. I think there is a property system being designed for executers that uses std::require so that name might be misleading in the future. So essentially, this replaces the raw pointers passed to alpaka kernels by buffers on the host side and views inside the kernel? I think we do not even need a function for this, because you can just pass alpaka buffers directly to createTaskKernel and the function translates those to the appropriate views for the kernel function. I still think however, that there is nothing wrong with having a pointer argument at the kernel function.

I think a first step might be to implement the automatic conversion of alpaka buffers on the host side into pointers at the kernel function interface inside createTaskKernel. That should get rid of getPtrNative for device buffers.

  1. If the kernel requires shared memory, the user must call alpaka::declareSharedView<TElem, TDim> as parameter during kernel creation: alpaka::createTaskKernel(..., alpaka::declareSharedView<TElem, TDim>(TExtent const& extent)). This will generate an alpaka::view to shared memory which can be used inside the kernel.

Does that fully replace shared variables of statically known size? I like alpaka's auto& sharedInt = declareSharedVar<int>();. I think the compiler might be able to optimize better, if the amount of shared memory needed is known at compile time.

  1. The current contents of alpaka/include/mem/view will be removed.
  2. The current contents of alpaka/include/block/shared will be removed. This includes a complete removal of static shared memory.

Be careful, static shared memory (that is with compile time known size) might offer better optimization opportunities. I would not fully get rid of this feature.

  1. Introduce an alpaka::view datatype which acts as memory access abstraction inside the kernel (and possible on the host). This would look somewhat like this:

enum class viewMode
{
    ReadOnly,
    WriteOnly,
    ReadWrite
};

I know these modes from OpenCL. I guess they are in SYCL as well? Because we can express ReadOnly and ReadWrite easily with const: view<float, ...> vs. view<const float, ...>. That's also how std::span<T> is designed.

enum class viewTarget { HostMemory, GlobalMemory, ConstantMemory, SharedMemory };

Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for nvcpp. This also affects alpaka buffers, because then they are no longer bound to a device.

template <typename TElem, typename TDim, typename TIdx, alpaka::viewMode mode, alpaka::viewTarget target> class view { using value_type = / TElem or TElem const /; using reference = / TElem& or TElem const& /; using pointer = / Backend-defined /; // and const_pointer using iterator = / Backend-defined /; // and const_iterator using reverse_iterator = / Backend-defined /; // and const_reverse_iterator / Constructors, copy / move operators, destructor / reference operator[](/ ... /); Vec get_extent(); pointer get_pointer(); // if you really need a raw pointer

Yes, we really need the pointer! More on that later.

auto get_byte_distance(); // = pitch

I do not like the name. This function also only makes sense for 2D buffers. So maybe conditionally provide it? What about 3D buffers?

/* more utility functions */

};

Some general thoughts which we also discussed offline already: I think there are two concerns involved:

  1. providing storage: this is what buffers should be for. They allocate storage of a given size and own it.
  2. structure of data: accomplished by views. And there are a variety of them: std::span<T>, the proposed std::mdspan<T, ...>. Also interpreting the storage pointed to by a buffer as e.g. a float* or as a MyStruct* is a way of forcing structure on a region of memory. MallocMC interprets a region of storage has a heap and builds a complicated data structure within that.

There are various ways to implement 1 and 2. For 1 we usually have to deal with API functions like cudaMalloc, cudaMallocManaged, malloc, new, ::operator new(..., std::align_val_t{...}). For 2 we have span, mdspan and surprisingly reinterpret_cast. Historically, there are also mixtures of 1 and 2 like std::vector or Kokkos Views. These provide storage and govern the interpretation of the data.

Why does this matter? Because there needs to be an interface between 1 and 2. There needs to be a way to communicate storage created by 1 to a facility for interpretation 2. Surprise, surprise, the easiest such interface is a pointer and a length. And this is such a universal protocol, because if I can extract a pointer out of a buffer, I can wrap a span<float> over it, or interpret it as a 2D structured array with mdspan<MyStruct, dynamic_extent, dynamic_extent>(ptr, width, height);.

LLAMA goes one little step further, because it allows to create data structures accross multiple buffers. But fundamentally, a LLAMA view is built on top of a statically sized array of storage regions. These storage regions are untyped, i.e. spans of std::byte and the LLAMA view fully governs how that storage is interpreted. All LLAMA needs is operator[size_t] to work on that region of std::bytes.

Example:

void kernelFunc(std::byte* data, int width, int height) {
    auto mapping = ...; // configure the data structure
    llama::View view(mapping, {data});

    // access
    float v = view(x, y)(Tag1{}, Tag2{});
}
j-stephan commented 3 years ago

What is the rational for renaming the function? I think allocBuf is a good name.

Because there is not necessarily an allocation taking place: Both sub-buffer creation and taking ownership of host memory don't involve any allocation.

That overload should just take a range.

You mean like (InputIterator start, size_t n)?

And I think it is worthwhile do distinguish between the iterator concepts input, forward, random access and contiguous (new in C++20) iterators.

What would be the benefit here? We don't really care about the original host container, this is just for buffer initialization.

But a more profound question: is this even supported across all backends? Can CUDA reuse memory for buffers allocated using the CRT, i.e. malloc? I know OpenCL has such a feature, so i assume SYCL has it as well.

SYCL can do this, as do the CPU backends. CUDA seems to be the exception here, unless the host pointer was allocated with cudaMallocManaged AFAIK. But it has been a while since I used CUDA, maybe @sbastrakov or @psychocoderHPC can chime in here.

I think we need to be careful to not reinvent the wheel. Some thought has been poured into viewing memory or parts of it. That's why C++17 got string_view and C++20 got span<T>. There is also the proposal for mdspan<T, ...>, which merged from Kokkos views. These are good facilities and I would try to use them before duplicating their features on alpaka buffers.

I wasn't aware of mdspan. If it is likely to be standardized I see no issue with adopting this.

I think we only need one API to create a sub buffer. So either 1.4 or 2.

I agree. I'm leaning towards slice because of clearer intent.

What is the benefit of this new API? I can call auto p = getPtrNative(buffer); and use p[i] just nicely. If the buffer is multidimensional, than I can call: auto view = std::mdspan<T, std::dynamic_extent, std::dynamic_extent>(getPtrNative(buffer), width, height); and then call view(x, y);. I know mdspan is not standardized yet, but there is a production ready implementation in the Kokkos repository.

Again, I wasn't aware of mdspan. Maybe it would be a good idea to base our views on mdspan (or just use mdspan directly, I'm going to read the proposal after finishing this answer) and make getHostView and friends a convenience function? The mdspan ctor looks a little convoluted.

I think require is a bad name, because it is used in several other contexts. I think there is a property system being designed for executers that uses std::require so that name might be misleading in the future.

I'm open for alternative names. I was mainly basing this on SYCL accessors where require is also in use.

So essentially, this replaces the raw pointers passed to alpaka kernels by buffers on the host side and views inside the kernel? I think we do not even need a function for this, because you can just pass alpaka buffers directly to createTaskKernel and the function translates those to the appropriate views for the kernel function. I still think however, that there is nothing wrong with having a pointer argument at the kernel function.

I think a first step might be to implement the automatic conversion of alpaka buffers on the host side into pointers at the kernel function interface inside createTaskKernel. That should get rid of getPtrNative for device buffers.

I believe the interface is easier to use if we use require, createView or whatever name we come up with during kernel creation. This way the user immediately understands that he passes a buffer to kernel creation but will receive a view as kernel parameter. Regarding pointers as parameters: Well, the whole point of this issue is to make pointers obsolete ;-) I'm even leaning towards forbidding them completely (as parameters). If you need a pointer inside the kernel, extract it from the view.

Does that fully replace shared variables of statically known size? I like alpaka's auto& sharedInt = declareSharedVar<int>();. I think the compiler might be able to optimize better, if the amount of shared memory needed is known at compile time. Be careful, static shared memory (that is with compile time known size) might offer better optimization opportunities. I would not fully get rid of this feature.

Yes. Reason: It is impossible to implement this (in reasonable time) for the SYCL backend.

I know these modes from OpenCL. I guess they are in SYCL as well? Because we can express ReadOnly and ReadWrite easily with const: view<float, ...> vs. view<const float, ...>. That's also how std::span<T> is designed.

I like this.

Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for nvcpp. This also affects alpaka buffers, because then they are no longer bound to a device.

I'm not certain that we support unified memory in alpaka or plan to do so as this goes against our "everything explicit" policy.

auto get_byte_distance(); // = pitch

I do not like the name. This function also only makes sense for 2D buffers. So maybe conditionally provide it? What about 3D buffers?

It is the same for 3D buffers (since 3D buffers are just a stack of 2D buffers). Maybe use get_row_distance or something to make this clearer? For 0D and 1D buffers this would return 0. You usually only need this value to calculate offsets in n-d space with n > 1.

Some general thoughts which we also discussed offline already: I think there are two concerns involved:

1. providing storage: this is what buffers should be for. They allocate storage of a given size and own it.

2. structure of data: accomplished by views. And there are a variety of them: `std::span<T>`, the proposed `std::mdspan<T, ...>`. Also interpreting the storage pointed to by a buffer as e.g. a `float*` or as a `MyStruct*` is a way of forcing structure on a region of memory. MallocMC interprets a region of storage has a heap and builds a complicated data structure within that.

There are various ways to implement 1 and 2. For 1 we usually have to deal with API functions like cudaMalloc, cudaMallocManaged, malloc, new, ::operator new(..., std::align_val_t{...}). For 2 we have span, mdspan and surprisingly reinterpret_cast. Historically, there are also mixtures of 1 and 2 like std::vector or Kokkos Views. These provide storage and govern the interpretation of the data.

Why does this matter? Because there needs to be an interface between 1 and 2. There needs to be a way to communicate storage created by 1 to a facility for interpretation 2. Surprise, surprise, the easiest such interface is a pointer and a length. And this is such a universal protocol, because if I can extract a pointer out of a buffer, I can wrap a span<float> over it, or interpret it as a 2D structured array with mdspan<MyStruct, dynamic_extent, dynamic_extent>(ptr, width, height);.

As discussed offline: The pointer interface only works easily if the chunk of raw memory is actually contiguous. This assumption fails as soon as 2D/3D memory on GPUs is involved (which is why we need the row distance / pitch). Now you can also introduce FPGAs where you can reconfigure your elements (1,2,3,4,5,6,7,8) to live in four different memory blocks in the order of (1,3) (2,4) (5,7) (6,8). *(var + 1) will likely give you unexpected results.

LLAMA goes one little step further, because it allows to create data structures accross multiple buffers. But fundamentally, a LLAMA view is built on top of a statically sized array of storage regions. These storage regions are untyped, i.e. spans of std::byte and the LLAMA view fully governs how that storage is interpreted. All LLAMA needs is operator[size_t] to work on that region of std::bytes.

Example:

void kernelFunc(std::byte* data, int width, int height) {
    auto mapping = ...; // configure the data structure
    llama::View view(mapping, {data});

    // access
    float v = view(x, y)(Tag1{}, Tag2{});
}

This looks very nice and I definitely see a common meta-language here we need to flesh out.

bernhardmgruber commented 3 years ago

That overload should just take a range.

You mean like (InputIterator start, size_t n)?

No. I mean makeBuf<TElem, 1>(TDev const& dev, const Range& range). You can then call begin(range)/end(range) inside the function.

And I think it is worthwhile do distinguish between the iterator concepts input, forward, random access and contiguous (new in C++20) iterators.

What would be the benefit here? We don't really care about the original host container, this is just for buffer initialization.

Well, you are partially right. What matters if you need to copy element wise or if you can just bulk copy the bits. I think we should just ignore the iterator concept and default to std::copy which does the right thing.

But a more profound question: is this even supported across all backends? Can CUDA reuse memory for buffers allocated using the CRT, i.e. malloc? I know OpenCL has such a feature, so i assume SYCL has it as well.

SYCL can do this, as do the CPU backends. CUDA seems to be the exception here, unless the host pointer was allocated with cudaMallocManaged AFAIK. But it has been a while since I used CUDA, maybe @sbastrakov or @psychocoderHPC can chime in here.

So if I interpret this correctly, an alpaka program that uses host pointer adoption can either not be run using CUDA or needs to do an explicit copy of the host pointer's memory. The host pointer is even more complicated because you can create a buffer from the host pointer and then later write to the memory using the host pointer. So we cannot implement a clear ownership transfer. = Furthermore, host pointer adoption only meaningfully works for buffers of the host device, do they?

Honstly, I think we should skip the host pointer version for now. If I want to initialize my buffer from an existing memory region, I can just call overload 1.2 with the iterators/range.

I wasn't aware of mdspan. If it is likely to be standardized I see no issue with adopting this.

Have a look, it might influence your design. But it is probably not the full solution if your view still needs to govern address spaces.

I believe the interface is easier to use if we use require, createView or whatever name we come up with during kernel creation. This way the user immediately understands that he passes a buffer to kernel creation but will receive a view as kernel parameter.

The interface is definitely more bloated. This is what I am afraid of. Here is the vectorAdd alpaka example:

Now:

auto const taskKernel(alpaka::createTaskKernel<Acc>(
        workDiv,
        kernel,
        alpaka::getPtrNative(bufAccA),
        alpaka::getPtrNative(bufAccB),
        alpaka::getPtrNative(bufAccC),
        numElements));

With your require:

auto const taskKernel(alpaka::createTaskKernel<Acc>(
        workDiv,
        kernel,
        alpaka::require(bufAccA),
        alpaka::require(bufAccB),
        alpaka::require(bufAccC),
        numElements));

With my proposed implicit recognition of buffers:

auto const taskKernel(alpaka::createTaskKernel<Acc>(
        workDiv,
        kernel,
        bufAccA,
        bufAccB,
        bufAccC,
        numElements));

Regarding comprehendability: OpenCL has cl::Buffers on the host size and pointers at the kernel interface. That usually does not confuse people ;)

Regarding pointers as parameters: Well, the whole point of this issue is to make pointers obsolete ;-) I'm even leaning towards forbidding them completely (as parameters). If you need a pointer inside the kernel, extract it from the view.

That is an opinion and I am of the opposite one, but not strongly. But I wonder since some of the motivation is coming from the SYCL backend: would it not help to get rid of the pointers just on the host side? Just drop getPointerNative for device buffers. In the SYCL backend you would then get the pointers from the accessors right before calling the kernel entry function. Wouldn't that be enough?

Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for nvcpp. This also affects alpaka buffers, because then they are no longer bound to a device.

I'm not certain that we support unified memory in alpaka or plan to do so as this goes against our "everything explicit" policy.

Unified memory has different performance charakteristics. It can be much slower or much faster than the traditional device side buffers, depending on how much of the memory is touched by a kernel. So it is not a question of everything explicit or implicit. It is a question if alpaka wants to support that. But if we are going to redesign how buffers work, we should at least think about this question and if and how we want to address unified memory.

auto get_byte_distance(); // = pitch

I do not like the name. This function also only makes sense for 2D buffers. So maybe conditionally provide it? What about 3D buffers? It is the same for 3D buffers (since 3D buffers are just a stack of 2D buffers). Maybe use get_row_distance or something to make this clearer? For 0D and 1D buffers this would return 0. You usually only need this value to calculate offsets in n-d space with n > 1.

But doesn't a 3D buffer have 2 pitches? I think I can live with a pitch of 0 for 1D buffers. 0D buffers probably do not occur that often ;)

As discussed offline: The pointer interface only works easily if the chunk of raw memory is actually contiguous. This assumption fails as soon as 2D/3D memory on GPUs is involved (which is why we need the row distance / pitch).

AFAIK 2D/3D GPU buffers are still contiguous. They can just contain additional padding. So a pointer is still fine ;)

Now you can also introduce FPGAs where you can reconfigure your elements (1,2,3,4,5,6,7,8) to live in four different memory blocks in the order of (1,3) (2,4) (5,7) (6,8). *(var + 1) will likely give you unexpected results.

If *(var + 1) does not work for a T* var into a buffer of Ts, then yes, we are in big trouble. In this case having your view with operator[] is probably the safer way to go. But FPGAs are a niche IMO, so I would not want to sacrifice an easy API for a niche use case. So maybe my wish is: make those views easy and feel like a T* :)

Example:

void kernelFunc(std::byte* data, int width, int height) {
    auto mapping = ...; // configure the data structure
    llama::View view(mapping, {data});

    // access
    float v = view(x, y)(Tag1{}, Tag2{});
}

This looks very nice and I definitely see a common meta-language here we need to flesh out.

Thinking about it, LLAMA could probably also just work with SYCL accessors:

void kernelFunc(sycl::accessor<std::byte, 1, sycl::access::mode::read_write, sycl::access::target::global> data, int width, int height) {
     auto mapping = ...; // configure the data structure
     llama::View view(mapping, {data});

     // access
     float v = view(x, y)(Tag1{}, Tag2{});
}

I see no reason, why that should not compile or at least be easy to get compiling.

bussmann commented 3 years ago

It's great to see this discussion. Be aware of concepts like non-contiguous representation of data in memory and implicit concepts like unified memory. It's a jungle out there. Keep going, you're doing great!

sbastrakov commented 3 years ago

My two cents on the matter.

  1. I think it is really important that we agree what we mean by "buffer" as an English word for this discussion, regardless of proposed or existing alpaka implementation. For me it is natural to think that buffer owns memory, and so it does currently in alpaka. However from the discussion above it seems there are different opinions on this: e.g. proposed implicit conversion from pointers makes no sense for owning buffers (as then it impliticly transfers or even multiplies ownership); nor does creating subbuffers from existing buffers, as those subbuffers could not own memory (this was discussed on mattermost, perhaps this operation can simply return a view and there is no fundamental issue there).

  2. I believe everyone agrees that "view" / "span" is not owning. I would also suggest that alpaka API keeps raw pointers non-owning as is currently and typical in modern C++.

  3. I also like the design of std::span as a simple view on 1D data giving pointer plus a compile- or run-time size. Existing alpaka kernels for 1D data actually mostly operate on this level, just passing the two separately. And something like mdspan as a multidimentional version of it.

  4. I do not understand the part about shared memory. Is the idea that a kernel implementor creates a buffer on the device side and via some API says it's for shared memory?

j-stephan commented 3 years ago

@bernhardmgruber and I just had a VC where we also addressed this issue. A short summary:

Regarding @sbastrakov's points:

I think it is really important that we agree what we mean by "buffer" as an English word for this discussion

We also talked about that. I agree that sub-buffers are a confusing term in this sense. @bernhardmgruber proposed that we do slicing, subviews and so on exclusively on views and not on buffers to make this distinction clearer.

I do not understand the part about shared memory. Is the idea that a kernel implementor creates a buffer on the device side and via some API says it's for shared memory?

My idea was that we remove static shared memory completely and just rely on dynamic shared memory. But I agree with @bernhardmgruber's objection that this would remove a lot of convenience from alpaka.

bernhardmgruber commented 3 years ago

Thank you @j-stephan for the good summary.

  1. I think it is really important that we agree what we mean by "buffer" as an English word for this discussion, regardless of proposed or existing alpaka implementation. For me it is natural to think that buffer owns memory ...

I fully agree. A buffer owns a region of memory with a given size. I wanted to go even further and require it to be contiguous, but Jan told me that for FPGAs this might not be the case.

However from the discussion above it seems there are different opinions on this: e.g. proposed implicit conversion from pointers makes no sense for owning buffers (as then it impliticly transfers or even multiplies ownership);

You are right. The use of existing storage to create a buffer does violate the above meaning of a buffer. However, there are APIs that allow that. OpenCL has clCreateBuffer with CL_MEM_USE_HOST_PTR allowing to create a buffer over an existing memory region (https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html). It appears SYCL supports this as well. I think this is mostly because these APIs want to allow the usage of existing "buffers" allocated outside their APIs under the same interface as API allocated buffers. That is, an OpenCL buffer object should be fully usable independently whether it was allocated by the OpenCL runtime or a pointer into a std::vector that already resides somewhere in RAM.

I suggested to Jan to name this functionality differently, e.g. alpaka::adoptBuf(T*). And backends not supporting using existing memory directly would do the copy (e.g. CUDA). We could also skip the feature for now because it is more exotic.

... nor does creating subbuffers from existing buffers, as those subbuffers could not own memory

I agree as well. And as Jan said, I think we should allow slicing only on views into buffers, so they always stay non-owning.

BenjaminW3 commented 3 years ago

I only skipped through the thread so I am not sure if I got everything but here are my notes on this topic which explain the current state.

alpaka has two concepts for memory: Views and Buffers. As always in alpaka, there is no single implementation for those concepts, but trait specializations which make some specific implementations adapt to those concepts.

A View is the basic concept which supports DimType, ElemType, getPtrNative, getExtentVec, getPitchBytes, memcpy, memset and getDev. The View concept is an abstract mdspan. It has a compile-time dimensionality and element type and has getters for pointer, extent and pitches. The only thing that mdspan does not have is a way to get where the memory lives, which alpaka currently handles via getDev. alpaka already has such adaptions for std::vector and std::array so you can use them directly anywhere where a View is expected in alpaka. It should be really easy to add trait specializations for std::span and mdspan to implement the View concept. For the combination of existing plain pointers + size there is the ViewPlainPtr object. There is a ViewSubView object which can slice views arbitrarily (also slices of slices of slices, etc).

The Buffer concept extends the View concept so all Buffers are also always Views. The main additional feature of a buffer is that it allocates the memory. Furthermore the buffer can be mapped, unmapped, pinned and unpinned. There are specific buffers for specific accelerators, allocating memory in a specifc way for a specific device. What is still missing is support for managed memory which is automatically mapped into the memory space of a device when it is accessed. I am not sure yet how to express this with the current getDev mechanism. Maybe we would need super-devices which represent multiple devices at the same time.

All in all my opinion is that most of the things requested in the numbered list at the top is already there (except renaming) but in a more generic/abstract way. Enforcing a specific implementation of a View like alpaka::view/alpaka::splice, mdspan or similar is against the concept of alpaka where we define concepts and make implementations adapt to them via trait specializations.

What I originally wanted to document in this ticket is the need I saw for some better memory abstraction in the kernel where it is accessed:

sbastrakov commented 3 years ago

@bernhardmgruber to clarify my point about conversion of pointers to buffers and vice versa. I am not against that in principle, and this operation sometimes makes sense indeed. I was merely against doing so implicitly and thus causing uncertainty and errors regarding who owns the data. Having an explicit constructor or API function to do so is no problem with me as long as it's consistent with the meaning we (will) put on buffers, pointers, views.

bernhardmgruber commented 3 years ago

Thank you for explanation @BenjaminW3!

The Buffer concept extends the View concept so all Buffers are also always Views.

That is a design decision which I might not have done. So this means a concrete buffer implementation is also a concrete view?

So if we change the requirement that kernel arguments must be views now instead of plain pointers, this means we need to pass the alpaka kernels directly into the kernel function? This sounds pretty mad to me:

auto buffer = alpaka::allocBuf<float>(dev, count); // I forgot the correct args, sorry
auto taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, buffer, count); // pass buffer
...
void kernelFunc(alpaka::buffer<float>& data, int count) { // receive view (which is buffer)
    ...
    float v = data(i)(Tag1{}, Tag2{});
}

I think we had this case in some unit test at some point and that caused issues with VS 2019 and @psychocoderHPC and me decided to not allow alpaka buffers inside kernels. The type passed into the kernel function needs to be a more "lightweight" type.

I think we might talk about a different type of view here. One of the motivations for this different view type stems from the need for address space qualifiers in SYCL. So we will have an alpaka::view<float, constant> and an alpaka::view<float, global>, which are different things and only have meaning inside the kernel. Thus std::vector or std::array can never be such views, because they cannot carry this additional qualification on their storage.

Furthermore, I think this view type could be a single type provided by alpaka. Because now we also use the same type for all backends at the kernel interface, which is a T*. That would change to a universal alpaka::view<T>.

Maybe we should change our naming and call this type of view really just accessor, the same as in SYCL?

  • I do not want to pass raw pointers into the kernel

I think @j-stephan has the same goal here. But also potentially adding the semantic of address space qualifiers.

  • The kernel code should look identical irrespective of if the memory is accessed column-major or row-major or other similar performance relevant specifics

This is solved by LLAMA already, although I did not yet promote this library for this use case.

bernhardmgruber commented 3 years ago

to clarify my point about conversion of pointers to buffers and vice versa. I am not against that in principle, and this operation sometimes makes sense indeed. I was merely against doing so implicitly and thus causing uncertainty and errors regarding who owns the data. Having an explicit constructor or API function to do so is no problem with me as long as it's consistent with the meaning we (will) put on buffers, pointers, views.

Let me clarify as well: I wanted to implicitely convert an alpaka buffer passed to alpaka::createTaskKernel into a T* at the kernel entry function. So I would like this to work:

auto buffer = alpaka::allocBuf<float>(dev, count);
auto taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, buffer, count); // pass buffer
...
void kernelFunc(T* data, int count) { // receive ptr
    ...
}

Nothing else will work implicitely. This should NOT work:

auto buffer = alpaka::allocBuf<float>(dev, count);
T* data = buffer; // madness

The other way around, we have the new functionality proposed by @j-stephan:

T* p= ...; // existing data
auto buffer = alpaka::allocBuf<float>(dev, count, p); // 1. copies data from p
auto buffer = alpaka::adoptBuf<float>(dev, count, p); // 2. uses T's storage

Feature 1 is reasonable I think. Feature 2 is inspired by SYCL's ability to reuse host memory. I would skip this feature for now.

BenjaminW3 commented 3 years ago

@psychocoderHPC and me decided to not allow alpaka buffers inside kernels. The type passed into the kernel function needs to be a more "lightweight" type.

Yes, we may not want to copy alpaka Buffers into a kernel. The buffer owns the memory and we can not transfer or share ownership into/with a kernel. Currently we call getPtrNative to convert the Buffer/View into something that can be passed into a kernel. Alpaka does everything explicitly, but doing such a conversion implicitly should also be possible. However, the type that is used to make memory accessible in a kernel should not be a plain pointer.

The question is what the type and name of a lighweight view that is used to access memory within a kernel is. Accessor sounds good to me. It should be easy to write a trait which converts an arbitrary View into an accessor.

psychocoderHPC commented 3 years ago

I was not able to follow the full discussion but I try to read all soon.

I would like to point all to the Mephisto buffers. The device-side buffer representation contains the meta data in a meta data section before the main data in main memory. This allows having all expensive data available on device (if needed) but reduces the object size to a minimum.

sbastrakov commented 3 years ago

As far as I understood it, that linked Mephisto "device buffer" looks more or less like a std::span, adapted for that use case.

SimeonEhrig commented 3 years ago

Found an example that surprised me and is related to the issue:

// the dim of hostMen is 3 dimensional with the sizes (1, 1, n)
TRed* hostNative = alpaka::mem::view::getPtrNative(hostMem);
for(Idx i = 0; i < n; ++i)
{
    // std::cout << i << "\n";
    hostNative[i] = static_cast<TRed>(i + 1);
}

We have a 1-dimensional access to a 3D memory. The official example is similar:

// hostBuffer is 3 dimensional
Data* const pHostBuffer = alpaka::getPtrNative(hostBuffer);

    // This pointer can be used to directly write
    // some values into the buffer memory.
    // Mind, that only a host can write on host memory.
    // The same holds true for device memory.
    for(Idx i(0); i < extents.prod(); ++i)
    {
        pHostBuffer[i] = static_cast<Data>(i);
    }

This only works because we expect a certain memory layout. But it could also be possible that the data contains a pitch. Then we don't have a memory violation, but we have data in the wrong place and the result will be wrong. I know there is a function to get the pitch, but as a user I don't want to handle that. I would rather use an access operator like view(z, y, x) and not worry about padding or anything like that. I would like to go the stl way. Use the normal functionality of the stl container and only use the get() function when needed and get ugly.

sbastrakov commented 3 years ago

I agree, such code samples assume linearized storage without pitches. Which is currently true, but maybe we don't want to rely on it,

psychocoderHPC commented 3 years ago

Thanks for starting the discussion about the memory topic! :+1:

1. 1.3 `makeBuf<TElem, TIdx>(TDev const& dev, TElem* ptr, TExtent const& extent, bool useHostPtr)` - create a buffer from a host pointer. Back-ends supporting direct usage of host memory may set `useHostPtr` to `true` in which case there will be no device-side allocation. Instead, the pointer is used directly.

A true or false in the factory points to missing policies. IMO the memory space, the location of memory, is missing. We need a way to describe it. The device alone is mostly not saying enough about it. The extent can be multidimensional but the pointer is sequential memory. IMO we need additional attributes to define how we can iterate over the memory or the pitch.

1.4 makeBuf<TElem, TIdx>(TBuf otherBuf, TOffset const& offset, TExtent const& extent) - create a sub-buffer from an existing buffer on the same device. The sub-buffer will refer to the same (parts of) memory as the original buffer, thus no additional allocation is taking place. This also means that the dimensionality of the sub-buffer must be lesser than or equal the original dimensionality and the extent cannot exceed the original extent.

This is a window and should be clearly identifiable as a view. suggestion: makeWindow<>. The reason why I not call it view is that it guarantees that each row is continuous memory.

  1. Introduce alpaka::slice for creating a sub-buffer. This is essentially an alias for 1.4

A slice is more complex than 1.4. You can describe that only each second element is selected.

  1. Introduce alpaka::getHostView for accessing buffers on the host. This will generate an alpaka::view which can be used on the host. This is only possible for devices that support host-side memory access.

Currently, we have a very relaxed concept of host in alpaka. "Everything which is not bonded to a device or accelerator" For me it feels not correct to have something like that. IMO if we define devices and round up how memory is connected to devices, platforms, ... you would always have a device where you operate on, even if it is implicit given (what is currently our "HOST" device). I think that we allow using memory without setting first properties where it lives, how to use it, is not the best way and lead into the requirement of "HOST" interfaces to be able to name the not consequent interface somehow.

  1. Replace alpaka::getPtrNative with alpaka::require for passing (sub-)buffers as kernel parameters. This will generate an alpaka::view to global or constant memory which can be used inside the kernel.

For me the question is, do we like to give some kind of views/buffer to the device and create the "iterator" later on the device or do we like to pass always an iterator to the device. Even if we do the second way in all our projects I think passing a lightweight buffer/view to the device and create the iterator on the device can have a lot of benefits. Creating a device object on device can much better handle device specifics e.g. use macros like CUDA_ARCH and other feature macros during the creation.

  1. The current contents of alpaka/include/block/shared will be removed. This includes a complete removal of static shared memory.

What do you have in mind how shared memory can be created on device if it is removed. Please keep in mind that CUDA dynamic shared memory is not equivalent to "static" shared memory. For "static" shared memory the compiler can during the compile use knowledge about the occupancy for the target device based on the shared memory usage. This will effect the register usage. I do not say it is required to be allowed to create shared memory everywhere in the program flow but static shared memory should not be removed.

psychocoderHPC commented 3 years ago

enum class viewTarget { HostMemory, GlobalMemory, ConstantMemory, SharedMemory };

Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for nvcpp. This also affects alpaka buffers, because then they are no longer bound to a device.

Even unified memory belongs to a device, or has a location but can be accessed from multiple devices. So we need still a way to describe the ownership and that's now I think what you like to point out: the visibility, location, ...

psychocoderHPC commented 3 years ago

Let's light up the bomb: how does unified memory fit into this picture? I think unified memory is getting increasingly relevant. Also because it is the default model for nvcpp. This also affects alpaka buffers, because then they are no longer bound to a device.

I'm not certain that we support unified memory in alpaka or plan to do so as this goes against our "everything explicit" policy.

IMO this should get a high priority. Unified memory is simplifying the programming a lot, gives you benefits e.g. oversubscribing memory, zero memory copies, ...

psychocoderHPC commented 3 years ago
auto get_byte_distance(); // = pitch

I do not like the name. This function also only makes sense for 2D buffers. So maybe conditionally provide it? What about 3D buffers?

I think pitch is a very common name but maybe I am CUDA branded. suggestion: row_stride

psychocoderHPC commented 3 years ago
```c++
auto const taskKernel(alpaka::createTaskKernel<Acc>(
        workDiv,
        kernel,
        alpaka::require(bufAccA),
        alpaka::require(bufAccB),
        alpaka::require(bufAccC),
        numElements));

With my proposed implicit recognition of buffers:

auto const taskKernel(alpaka::createTaskKernel<Acc>(
        workDiv,
        kernel,
        bufAccA,
        bufAccB,
        bufAccC,
        numElements));

auto const taskKernel(alpaka::createTaskKernel( workDiv, kernel, alpaka::require(bufAccA), alpaka::require(bufAccB), alpaka::require(bufAccC), numElements));

With my proposed implicit recognition of buffers:

auto const taskKernel(alpaka::createTaskKernel( workDiv, kernel, bufAccA, bufAccB, bufAccC, numElements));

alpaka::require(bufAccA) has the advantage that we can explicitly create a lightweight object for the device. To pass only the buffer bufAccA it would require that we check each object we pass to a kernel, identify it as buffer, during the kernel start and transform it to a lightweight buffer representation or iterator. I would point here to the concept of alpaka to be always explicit. If you have alpaka::require(bufAccA) you can always write a way to start your kernel without any explicit calls. The other way around is not possible. The disadvantage of being explicit it that the code is much longer and requires developer to build the implicit interface on top of the explicit interfaces.

psychocoderHPC commented 3 years ago

Let me clarify as well: I wanted to implicitely convert an alpaka buffer passed to alpaka::createTaskKernel into a T* at the kernel entry function. So I would like this to work:

auto buffer = alpaka::allocBuf<float>(dev, count);
auto taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, buffer, count); // pass buffer
...
void kernelFunc(T* data, int count) { // receive ptr
    ...
}

The disadvantage of pointers is that you lose meta/policy information about the memory. This means to write a fast deep copy method or use CUDA features e.g async memcopy to shared memory will be hard to integrate because alpaka can not be sure that the pointer is global memory. If we always use some kind over iterators/wrapper objects we can in the future better optimize memory copy operations based on iterator and device knowledge. As soon as we use pointer it is very hard to write performance portable code.

j-stephan commented 3 years ago

Yesterday we had a longer video conference about this issue. A few points from my notes:

Current shortcomings

Ideas

Concerning llama

Next steps

Feel free to expand on this if I forgot something.

bernhardmgruber commented 3 years ago

Regarding LLAMA interop, I formulated that into a concept: https://github.com/alpaka-group/llama/blob/develop/include/llama/Concepts.hpp#L24

For now I just require the type used for storage by a LLAMA view to be bytewise addressable. This is fulfilled by an alpaka buffer containing std::bytes or unsigned chars or an array of these types allocated via alpaka::declareSharedVar.

bussmann commented 3 years ago

We need to discuss this further! I think you are making a big domain error here! The hardware has an inherent concept of what memory looks like (usually like a continuous 1D array). But tjis concept might vary. Alpaka has an N-D index domain for trends. It is appealing to bring both tightly together. But this is not a good idea for the future. Don't confuse the memory concept of a hardware with the memory representation of a data type. Likewise, don't confuse the memory layout of a datatype with its user side layout. Finally, always remember that algorithms provide (and somewhat represent) access patterns to data types!

bernhardmgruber commented 3 years ago

@bussmann from your answer I can separate these concerns:

The hardware has an inherent concept of what memory looks like (usually like a continuous 1D array). But tjis concept might vary.

So we want to represent a piece of hardware memory. I think this is what an alpaka buffer should be. We could debate, if a buffer already has a dimensionality. And I would argue that it should not. IMO a buffer is a 1D byte storage, independent if you want to treat it as a 2D float array or put a complex heap into it like mallocMC. So maybe we should actually get rid of the dimension of a buffer. The problem is however that some vendor APIs mix this up. There is e.g. cudaMallocPitched for allocating a 2D buffer. So it mixes storage with interpretation.

I know you also mentioned to me more extraordinary types of computers which may have completely different memory architectures. But than C++ might no longer be the right language for these architectures. The C++ memory model specifies, that: "The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address" (http://eel.is/c++draft/intro.memory#1). So there will always be a sequence of bytes to which we can point, because it is addressable. We could debate whether an alpaka buffer can consist of multiple of these sequences. But since I need a pointer/length pair for each of the sequences, I think each sequence should just be a separate buffer. So an alpaka buffer can assume 1D continuous storage of bytes.

Alpaka has an N-D index domain for trends. It is appealing to bring both tightly together. But this is not a good idea for the future. Don't confuse the memory concept of a hardware with the memory representation of a data type. Likewise, don't confuse the memory layout of a datatype with its user side layout.

Assuming "trends"=threads: Indeed, this is appealing. If I have a 2D index space of threads because I want to describe an image algorithm, I would of course want my image stored in memory to be accessible via 2D indices (user side layout). So I want to bring my thread index domain and my data structure index domain together. This does not mean that my data structure needs to be a 2D array in memory (memory representation). But it should be accessible like one. I think this is what we are trying to get with views/accessors. They map my logical indexing to the bytes of a buffer.

What I do not understand yet is what you mean with the "memory concept of a hardware". Do you mean the physical organizaton into memory banks or different memory types (texture, shared, constant, global, plain old RAM, etc.)?

Finally, always remember that algorithms provide (and somewhat represent) access patterns to data types!

And that is something interesting and we did not cover that at all yet! Fortunately, we have a PhD student looking into this ;)

j-stephan commented 3 years ago

We will implement this for alpaka 0.7.0.

Assigned to @j-stephan.

j-stephan commented 3 years ago

Assigning to @bernhardmgruber as discussed in today's VC.

j-stephan commented 2 years ago

We now have accessors inside alpaka::experimental. The PR moving accessors into the regular namespace should link to and close this issue.

j-stephan commented 2 years ago

One thing that accessors are unsuitable for are data structures like linked lists or trees (thanks to @fwyzard for mentioning this). For these we should probably keep pointers unless we want to point everyone to LLAMA. @bernhardmgruber What are your thoughts on this matter?

bernhardmgruber commented 2 years ago

Keep pointers, they are a powerful escape hatch when a view to an array does not cut it. I have seen a bit of device pointer arithmetic on the host before passing the pointer to a kernel in a different project. That would not at all be possible with accessors.

j-stephan commented 1 year ago

We are going to keep pointers for non-contiguous data and we have std::mdspan support for people who don't want to deal directly with pointers to buffers. I think we can close this issue. Feel free to reopen if you believe otherwise.

bernhardmgruber commented 1 year ago

Just for completeness, the accessors implemented based on the discussion of this thread, have been removed again by #2054. std::mdspan should cover all these use cases.