[RFC] extend the Accelerator interface with compile-time grid dimensions

fwyzard commented 1 year ago

While writing a generic N-dimensional loop that works both on GPUs and CPUs, I realised that - from the point of view of the kernel - the grid dimensions are almost always dynamic.

To improve the efficiency of the loops, I think it could be useful to be able to specify some of the dimensions at compile time, and propagate that to the kernel.

Of course we also want to keep the possibility of specifying some of the dimensions at runtime. Eigen has been doing this since forever, using the Eigen::Dynamic constant to identify runtime-sized dimensions; c++20 introduces something similar with std::dynamic_extent for std::span. The Alpaka equivalent could be something like

namespace alpaka {
    template <typename TIdx>
    constexpr inline dynamic = std::numeric_limits<Idx>::max();
}

Given the Alpaka API, the best place to introduce this be to extend the Accelerator type:

namespace alpaka {
    template<
        typename TApi,
        typename TDim,
        typename TIdx,
        typename TBlocks = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        typename TThreads = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        typename TElems = Vec<TDim, TIdx>::all(dynamic<TIdx>)>
    class AccGpuUniformCudaHipRt final;
}

The work division parameters for the kernel launch should take into account the compile-time values.

And then the getWorkDiv functions used on the device can take advantage of the compile-time values to return them as constants.

Comments ?

bernhardmgruber commented 1 year ago

I fully support this idea (mentally) and I partially discussed it in #1824 where I proposed to replace the element layer by a compile-time size. A similar place with presedence for mixing compile- with runtime extents is std::mdspan's std::extents.

namespace alpaka {
    template<
        typename TApi,
        typename TDim,
        typename TIdx,
        typename TBlocks = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        typename TThreads = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        typename TElems = Vec<TDim, TIdx>::all(dynamic<TIdx>)>
    class AccGpuUniformCudaHipRt final;
}

I assume this is straw man syntax, becase we assign a value Vec<TDim, TIdx>::all(dynamic<TIdx>) to a type parameter TBlocks here. In my experience with LLAMA, there are two ways to do this. Use an actual value as template parameter:

namespace alpaka {
    template<
        typename TApi,
        typename TDim,
        typename TIdx,
        auto TBlocks = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        auto TThreads = Vec<TDim, TIdx>::all(dynamic<TIdx>),
        auto TElems = Vec<TDim, TIdx>::all(dynamic<TIdx>)>
    class AccGpuUniformCudaHipRt final;
}

Which would work if alpaka::Vec was a "structural type" (see here). However, this makes the template parameter list of the accelerators less suitable for type-list-based metaprogramming and would require at least C++20.

The alternative is using types, and I think I would really go for the standard's design:

namespace alpaka {
    template<
        typename TApi,
        typename TDim,
        typename TIdx,
        typename TBlocks = std::extents<TIdx, std::dynamic_extent, std::dynamic_extent, std::dynamic_extent>,
        typename TThreads = std::extents<TIdx, 8, 8, 4>,
        typename TElems = std::extents<TIdx, 1, 1, 1>>
    class AccGpuUniformCudaHipRt final;
}

And replace alpaka::Vec in many places. Alternatively, we could extend alpaka::Vec to support such compile-time values.

One big problem you have now though is, that you could no longer run two different kernels with different work divisions on the same accelerator. Which is why I would perform all this customization on the workdiv instead. This would however cause the auto& acc inside the kernel to have a different type than the Acc type outside of kernels.

fwyzard commented 1 year ago

Thanks for your comments !

I personally like the non-type template parameters best, but thanks for pointing out that it would require c++20 - which at the moment would not be viable for CMS, either.

One big problem you have now though is, that you could no longer run two different kernels with different work divisions on the same accelerator.

Mhm, why not ? The way I was thinking it was along the lines of (using some strawman syntax)

// fully dynamic work division
alpaka::exec<alpaka::AccGpuCudaRt<Dim2D, uint32_t>>(queue, workdiv, kernel{}, ...);

// fix the block size and the number of elements per thread
alpaka::exec<alpaka::AccGpuCudaRt<Dim2D, uint32_t, alpaka::dynamic<uint32_t>, 32u, 1u>>(queue, workdiv, kernel{}, ...);

so each kernel launch could use its configuration.

I wouldn't see as much of a problem, given that the accelerator object is only ever available inside a kernel, not outside.

Which is why I would perform all this customization on the workdiv instead.

Sure, that would be fine as well.

This would however cause the auto& acc inside the kernel to have a different type than the Acc type outside of kernels.

I don't see a problem here, because as far as I know we never instantiate accelerator objects on the host.

bernhardmgruber commented 1 year ago

We discussed this in the VC today and concluded that we would like to have this. We will need to alter the template parameter list of the accelerators in any case, since this type is passed into the device code and needs to hold the compile-time information. We prefer attaching the static extents to the workdiv, but the extra template parameters would propagate to the accelerator nonetheless. We would like to see a small prototype of the feature to make further decisions.

fwyzard commented 1 year ago

One more idea could be to change the signature of the alpaka kernels from

template <typename TAcc, typename... TArgs>
void operator()(TAcc const& acc, TArgs... args)

to

template <typename TAcc, typename TGrid, typename... TArgs>
void operator()(TAcc const& acc, TGrid grid, TArgs... args)

The TGrid type could have some of the sizes as constexpr (e.g. part of the type), and leave others as dynamic (known only at runtime).

The kernel would get the grid dimension and sizes from the grid object instead of the accelerator.

This way the accelerator object would not need to change depending on the details of the kernel launch parameters.

I'm not 100% convinced myself that this is a good approach, but it seems better than embedding the grid details in the accelerator type TAcc.

alpaka-group / alpaka

[RFC] extend the Accelerator interface with compile-time grid dimensions #1930