Open fwyzard opened 1 year ago
I fully support this idea (mentally) and I partially discussed it in #1824 where I proposed to replace the element layer by a compile-time size. A similar place with presedence for mixing compile- with runtime extents is std::mdspan
's std::extents
.
namespace alpaka {
template<
typename TApi,
typename TDim,
typename TIdx,
typename TBlocks = Vec<TDim, TIdx>::all(dynamic<TIdx>),
typename TThreads = Vec<TDim, TIdx>::all(dynamic<TIdx>),
typename TElems = Vec<TDim, TIdx>::all(dynamic<TIdx>)>
class AccGpuUniformCudaHipRt final;
}
I assume this is straw man syntax, becase we assign a value Vec<TDim, TIdx>::all(dynamic<TIdx>)
to a type parameter TBlocks
here. In my experience with LLAMA, there are two ways to do this. Use an actual value as template parameter:
namespace alpaka {
template<
typename TApi,
typename TDim,
typename TIdx,
auto TBlocks = Vec<TDim, TIdx>::all(dynamic<TIdx>),
auto TThreads = Vec<TDim, TIdx>::all(dynamic<TIdx>),
auto TElems = Vec<TDim, TIdx>::all(dynamic<TIdx>)>
class AccGpuUniformCudaHipRt final;
}
Which would work if alpaka::Vec
was a "structural type" (see here). However, this makes the template parameter list of the accelerators less suitable for type-list-based metaprogramming and would require at least C++20.
The alternative is using types, and I think I would really go for the standard's design:
namespace alpaka {
template<
typename TApi,
typename TDim,
typename TIdx,
typename TBlocks = std::extents<TIdx, std::dynamic_extent, std::dynamic_extent, std::dynamic_extent>,
typename TThreads = std::extents<TIdx, 8, 8, 4>,
typename TElems = std::extents<TIdx, 1, 1, 1>>
class AccGpuUniformCudaHipRt final;
}
And replace alpaka::Vec
in many places. Alternatively, we could extend alpaka::Vec
to support such compile-time values.
One big problem you have now though is, that you could no longer run two different kernels with different work divisions on the same accelerator. Which is why I would perform all this customization on the workdiv instead. This would however cause the auto& acc
inside the kernel to have a different type than the Acc
type outside of kernels.
Thanks for your comments !
I personally like the non-type template parameters best, but thanks for pointing out that it would require c++20 - which at the moment would not be viable for CMS, either.
One big problem you have now though is, that you could no longer run two different kernels with different work divisions on the same accelerator.
Mhm, why not ? The way I was thinking it was along the lines of (using some strawman syntax)
// fully dynamic work division
alpaka::exec<alpaka::AccGpuCudaRt<Dim2D, uint32_t>>(queue, workdiv, kernel{}, ...);
// fix the block size and the number of elements per thread
alpaka::exec<alpaka::AccGpuCudaRt<Dim2D, uint32_t, alpaka::dynamic<uint32_t>, 32u, 1u>>(queue, workdiv, kernel{}, ...);
so each kernel launch could use its configuration.
I wouldn't see as much of a problem, given that the accelerator object is only ever available inside a kernel, not outside.
Which is why I would perform all this customization on the workdiv instead.
Sure, that would be fine as well.
This would however cause the
auto& acc
inside the kernel to have a different type than theAcc
type outside of kernels.
I don't see a problem here, because as far as I know we never instantiate accelerator objects on the host.
We discussed this in the VC today and concluded that we would like to have this. We will need to alter the template parameter list of the accelerators in any case, since this type is passed into the device code and needs to hold the compile-time information. We prefer attaching the static extents to the workdiv, but the extra template parameters would propagate to the accelerator nonetheless. We would like to see a small prototype of the feature to make further decisions.
One more idea could be to change the signature of the alpaka kernels from
template <typename TAcc, typename... TArgs>
void operator()(TAcc const& acc, TArgs... args)
to
template <typename TAcc, typename TGrid, typename... TArgs>
void operator()(TAcc const& acc, TGrid grid, TArgs... args)
The TGrid
type could have some of the sizes as constexpr
(e.g. part of the type), and leave others as dynamic (known only at runtime).
The kernel would get the grid dimension and sizes from the grid
object instead of the accelerator.
This way the accelerator object would not need to change depending on the details of the kernel launch parameters.
I'm not 100% convinced myself that this is a good approach, but it seems better than embedding the grid details in the accelerator type TAcc
.
While writing a generic N-dimensional loop that works both on GPUs and CPUs, I realised that - from the point of view of the kernel - the grid dimensions are almost always dynamic.
To improve the efficiency of the loops, I think it could be useful to be able to specify some of the dimensions at compile time, and propagate that to the kernel.
Of course we also want to keep the possibility of specifying some of the dimensions at runtime. Eigen has been doing this since forever, using the
Eigen::Dynamic
constant to identify runtime-sized dimensions; c++20 introduces something similar withstd::dynamic_extent
forstd::span
. The Alpaka equivalent could be something likeGiven the Alpaka API, the best place to introduce this be to extend the Accelerator type:
The work division parameters for the kernel launch should take into account the compile-time values.
And then the
getWorkDiv
functions used on the device can take advantage of the compile-time values to return them as constants.Comments ?