alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
349 stars 72 forks source link

Memory abstraction tags #1362

Open SimeonEhrig opened 3 years ago

SimeonEhrig commented 3 years ago

At the moment, I develop a prototype for a lazy evaluated linear algebra library basing on alpaka and vikunja: https://github.com/SimeonEhrig/lazyVikunjaVector

One of the design decision is, that I use mathematical objects like vector and matrix with implemented operators to do the mathematics.

Vector<int, 5> v1;
Vector<int, 5> v2;

Vector result = eval(v1 + v2);

The vector contains also the data. Therefore, the vector needs to know on which device it is located, e.g. CPU 0 or CUDA GPU 1. This information are used for memory allocation and preventing operations between different device, e.g add a vector located on CPU 0 with a vector located on CUDA GPU 1 (it's just a design decision to keep the library "simple").

At the moment I use alpaka::AccType<TDim, TIdx> and the device id, to decide the memory owner of the mathematical object (e.g. a vector) but @psychocoderHPC means that the memory is not bounded to the parallelization strategy. For example memory allocated with alpaka::AccCpuSerial<Dim, std::size_t> can be also used in a kernel, which is executed with parallelization strategy alpaka::AccCpuOmp2Blocks<Dim, std::size_t>, because both uses the same allocator.

My question is, there are some template tags, like alpaka::AccCpuSerial and alpaka::AccCpuOmp2Blocks for the memory, e.g. alpaka::BufCPU or alpaka::BufCUDAGPU and if yes, how I can used it for memory allocation?

This tags would be also pretty helpful for specialization of certain function, e.g.:


// generic implementation with alpaka 
template<TAcc>
Vector operator+(Vector a, Vector b){
  Vector res;
  AddKernel kernel;
  alpaka::exec<TAcc>(...);
  // ...
  return res;
}

template<>
Vector operator+<alpaka::BufCPU>(Vector a, Vector b){
   // use eigen3 for highly optimized vector addition on CPU 
}

template<>
Vector operator+<alpaka::BufCUDAGPU>(Vector a, Vector b){
   // use cuBlas for highly optimized vector addition on CUDA GPU 
}
j-stephan commented 3 years ago

Would this tie in with our efforts on accessors? #1249

I think those have much of the functionality you ask for.

SimeonEhrig commented 3 years ago

Would this tie in with our efforts on accessors? #1249

I think those have much of the functionality you ask for.

I'm note sure if this planned/implemented in accessors. We have to ask @bernhardmgruber Nevertheless accessors are big subject and it takes some time to finalize this. I think at tag list is much easier and faster to implement and solves my problem.

BenjaminW3 commented 3 years ago

What you probably want to have is a hierarchy of memory spaces. However it is not easy to tell where a specific memory allocation will ever be accessible. You could bind it to the acceleration strategy but as you found out, that might be too strict. You could split it into a CPU and multiple GPU memory spaces (alpaka::DevCpu and alpaka::DevGpuCuda). This may be better, but that is still not enough. If you have mapped/managed/whatever memory, you still may be able to access the GPUs memory from the CPU or memory across GPUs. Sometimes the CPU may access the GPU memory but not the other way around. In my view, it is not possible to define such a simple tag based hierarchy of memory spaces for overlapping memory spaces with non bidirectional access.

bussmann commented 3 years ago

@bernhardmgruber any ideas on this?

bernhardmgruber commented 3 years ago

Would this tie in with our efforts on accessors? #1249

I think those have much of the functionality you ask for.

IMO an accessor specifies what kind of operations are allowed on a memory resource and how loads and stores are handled. Accessors can be produced on top of buffers. Whether they carry on the information where the backing memory resource resides is an open questions, but I would prefer if they did not. So you can have e.g. a read-only accessor independenly of whether you access a GPU or CPU buffer.

Alpaka buffers are tied to something like a memory space (Kokkos terminology) inside which they can be validly accessed. These memory spaces are related to acceleration technologies, but not the same. Having such memory spaces is complicated however, as @BenjaminW3 said, because there are increasingly wild ways on how these span devices. Some time ago we had GPU only buffers, now we can share buffers between CPU and GPU and have memory pages migrate on the fly. I believe there might be more such changes in the future so I would not like to see alpaka paint itself into a corner by choosing a memory space system.

I also think you might be approaching the problem from the wrong side. It is more important on what kind of accelerator a computation runs than where a buffer resides. So taking your example:

Vector<int, 5> v1;
Vector<int, 5> v2;

auto unevaluatedResult = v1 + v2; // build expression tree
Vector result = eval(unevaluatedResult , acc); // evaluate expression tree on a specific accelerator
...
template<>
Vector operator+<alpaka::AccCPU>(Vector a, Vector b){
   // use eigen3 for highly optimized vector addition on CPU 
}
template<>
Vector operator+<alpaka::AccOmp>(Vector a, Vector b){
   // use eigen3 for highly optimized vector addition on CPU 
}

template<>
Vector operator+<alpaka::CudaRt>(Vector a, Vector b){
   // use cuBlas for highly optimized vector addition on CUDA GPU 
}
template<>
Vector operator+<alpaka::HipRt>(Vector a, Vector b){
   // use hip BLAS if such a thing exists
}

You need to bring in the acc at some point. Whether you do that when evaluating, or if you already pass that into your Vector ctor is up to you. Furthermore, I wonder how does the Vector ctor know how to allocate memory?

bernhardmgruber commented 1 year ago

@SimeonEhrig what needs to be done here? Do you want to have tags similar to the recently added tags for accelerators from #1804?

SimeonEhrig commented 1 year ago

Yes. But I think the memory tags are more dedicated to check your code. I want to have a trait something like this: alpaka::trait::can_use<AlpakaAccTag, AlpakaMemTag> (the naming is not final - only for explanation). It should return true, if an acc can access the memory without memory copy.

For CUDA without managed memory, this a 1:1 relation (CudaRT to CudaMem). For CPU, it is many to many. For example if you create memory with the serial backend, you can also access it with the OpenMP 2 blocks backend without memory copy.

fwyzard commented 1 year ago

What about a CUDA GPU vs a pinned host memory buffer ? The memory can be accessed by the GPU (read, write, atomic operations, etc.) but it may not be as fast as the GPU global memory.