alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
336 stars 69 forks source link

Add tensor core abstractions #1346

Open j-stephan opened 3 years ago

j-stephan commented 3 years ago

In the meeting on 25 May 2021 we discussed having an alpaka abstraction for the various tensor core APIs found in recent versions of CUDA and ROCm. Opening this issue for broader discussion (and avoiding to forget this wish).

bernhardmgruber commented 3 years ago

A quick google search revealed this to me: https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

So it looks like programmatic access to tensor cores is given via special API calls and they can essentially only do FMA on 4x4 matrices in single and half precision. That sounds very limited to me. But hey, that's what special purpose hardware is all about! I see potential use for linear algebra. 4x4 matrices are also heavily used in 3D graphics and computational geometry. Still, although these fields were the prime target of GPUs, the need for tensor cores only appeared much much later with deep learning.

I have not found the corresponding APIs in HIP, also not in OpenCL or SYCL. So I don't know how AMD exposes them. For CPU targets I guess you have to model these 4x4 matrix FMAs with just normal floats. There is also a new BF16 float type, but that is super new: https://stackoverflow.com/a/49997863/2406044

I think access to tensor cores and reduced precision FP operations are too vendor specific for the moment to design a meaningful API. But please proof me wrong! :)

j-stephan commented 3 years ago

AMD calls them Matrix Cores, at least one GPU (MI100) has them already. I haven't found the accompanying API in HIP yet, though.

bernhardmgruber commented 4 months ago

So, I found out today that the public facing API for access tensor cores from CUDA is via cutlass . Specifically mma.h, which essentiall sets up some blocks of floats and calls a PTX mneumonic.

fwyzard commented 4 months ago

Isn't it documented also in the CUDA Programming Guide under 7.24. Warp Matrix Functions ?