[EPIC] std::simd support in libcu++

jrhemstad commented 1 year ago

We should add a heterogeneous implementation of std::simd to libcu++.

High-level goals:

Works in host/device code
Replace the need for the CUDA vector types like int4/double2
Use simd::copy_from/copy_to to standardize how vectorized load/stores should be done in device code (replace status quo)
Exposure for CUDA SIMD intrinsics

### Tasks
- [ ] Review/discuss CUTLASS implementation of similar types
- [ ] Participate in LEWG discussion on incorporating `<simd>` (see [p1928](https://wg21.link/p1928))

nouiz commented 1 year ago

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

jrhemstad commented 1 year ago

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

I'm not sure I follow. Isn't floatN the status quo?

nouiz commented 1 year ago

I see 2 way to trigger vectorized loads:

__kernel__ f(float2* in){
use in directly.
}

and what I consider the status quo:

__kernel__ f(float* in){
...reinterpret_cast...
}

The first case remove the reinterpret_cast, but it limits the API to multiple of 2 elements. The second doesn't limit the API, but request ugly code.

Does std::simd allows to keep a clean API and not request ugly code?

jrhemstad commented 1 year ago

Does std::simd allows to keep a clean API and not request ugly code?

Indeed.

Instead of

__global__ f(float* in){
   float4 vec = *reinterpret_cast<float4*>(in);
}

We have

__global__ f(float* in){
   std::fixed_size_simd<float, 4> vec{in, std::vector_aligned}; 
}

One of the other advantages of std::simd over float4 is that simd types come with well-defined binary operators like operator+, whereas float4 does not. There is a whole host of machinery you get for free with a std::simd type that you would need to implement yourself with float4.

nouiz commented 1 year ago

Great. Does it helps for the last few elements of the row that isn't a multiple of N?

bernhardmgruber commented 3 months ago

I think the real game changer of std::simd in libcu++ is that it allows generic single-source portable SIMD programming. I can write a kernel and it will explicitely (guaranteed or compilation error) vectorize for a CPU target, and collapse to scalar code on a GPU target. This is a huge improvement over relying on auto-vectorization of scalar code, which is brittle, but compiles for CUDA and CPU targets. Also, barely any SIMD library supports CUDA (Kokkos SIMD is a notable exception). So explicit SIMD code is often locked onto CPU targets. The result is you have to again maintain two code paths when you want to target CPU and GPU, or write a (probably worse) SIMD abstraction library yourself than what we could provide here. I have written one myself:

Here is a small portable kernel, using alpaka (I was collaborator) for kernel abstraction and LLAMA (author is me) for data layout abstraction, of an n-body simulation, updating particle positions based on their velocities:

template<int ElementsPerThread>
struct MoveKernel
{
    template<typename Acc, typename View>
    ALPAKA_FN_HOST_ACC void operator()(const Acc& acc, View particles) const
    {
        const auto ti = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0];
        const auto i = ti * ElementsPerThread;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> pos;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> vel;
        llama::loadSimd(particles(i)(tag::Pos{}), pos);
        llama::loadSimd(particles(i)(tag::Vel{}), vel);
        llama::storeSimd(pos + vel * +timestep, particles(i)(tag::Pos{}));
    }
};

Source: https://github.com/alpaka-group/llama/blob/develop/examples/alpaka/nbody/nbody.cpp#L221-L230

The ElementsPerThread is the parameter choosing the behavior of llama::SimdN. If 1, the kernel collapses into scalar code. If >1, SIMD types are used and with the right compiler flags AVX2, AVX512 or NEON etc. is produced. The MakeSizedBatch is essentially a wrapper around xsimd::make_sized_batch_t<T, N>, which is the SIMD library I used. std::simd in libcu++ could entirely cover and standardize this use case.

My example above does more, which is not in scope of std::simd (yet), like creating SIMD-fied structs (Vec3 is a struct of 3 floats here) and abstracting load/store from data layouts (particles can be a struct-of-arrays container here as well).

NVIDIA / cccl

[EPIC] std::simd support in libcu++ #30