Open TravisWhitaker opened 5 years ago
It would also be valuable to include types to represent the additional reduced-precision quantities that WMMA supports, such as short floats and four bit ints. It may be difficult to do this in a way that's compatible with existing backends, however.
Draft proposal:
Accelerate is a DSL for parallel array-based computations. A unique feature of the language is runtime compilation, allowing Haskell to be used a a meta-language for dynamically assembling a massively-parallel array program in a retargetable fashion. The language is currently capable of targeting Nvidia GPUs with CUDA cores via LLVM, leveraging the PTX LLVM extension. The current LLVM code generation strategy is based on the paper Type-safe Runtime Code Generation: Accelerate to LLVM by Trevor L. McDonell, Manuel M. T. Chakravarty, Vinod Grover, and Ryan R. Newton.
Nvidia have recently released hardware with "TensorRT" cores. These are merely dedicated execution resources for fused-multiply-accumulate operations. Nvidia calls these WMMA operations, or warp-level matrix multiply accumulate. LLVM gained support for them here and Nvidia's ISA documentation is here.
Accelerate programs may be well-positioned to take advantage of these new intrinsics. The project involves adding support for WMMA intrinsics to the Accelerate code generator, as well as evaluating their performance impact by leveraging benchmarks. The project will involve work with a large embedded DSL that leverages advanced type system features, code generation with LLVM, and massively parallel computing resources.
I think this might be possible to implement at least some of this today, which is a good start for a project like this.
We have support for Float16, but not Int4, so that would need to be added to support every instruction, but Float16 will be enough to demonstrate if it works. We also support short vectors (e.g. 8xFloat16
) which also seems necessary.
There is a foreign function interface which you can use to emit arbitrary LLVM instructions. The current design is based on the idea of calling some function of type a -> b
, which if I read these wmma.{load, store, mma}
functions correctly doesn't quite fit, but I think we can make it work. Currently these foreign functions don't get access to anything other than the input value a
; we might need to relax that restriction to make these functions work, or something else, but maybe not (I only read the documents you linked briefly, so am not 100% sure how the instructions should be called).
Getting the code generator to automatically emit these instructions is probably more difficult. I think this project should just focus on exposing them to the accelerate user in a nice way (again, I'm not sure what this would look like yet). Once we have more experience using them, it may become more obvious where/how they should be integrated into the standard code generator.
I don't have access to an RT card right now, so my ability to help mentor may be limited.
Hardware access is going to be tricky; I don't even think there are AWS instances with access to this hardware available yet.
I think as long as a foreign function is allowed to allocate storage for it's result on it's own, it should be possible to make little test kernels in raw PTX and call them with the FFI.
I think this kind of thing might fall outside what the scalar foreign function interface was meant to provide... it seems that you need all threads in the warp to participate, but the result of an individual thread is undefined? I think, that will kind of break the model of accelerate where we don't talk about individual warps/blocks etc. (and, what should we do for the other backends who try to execute this instruction?).
This would be better suited for use within a foreign kernel declaration, but then there doesn't need to be substantial integration with accelerate at that point.
It's true that Accelerate would need to treat kernels that use WMMA specially, since they operate at the warp-level. Foreign functions using WMMA wouldn't be executable on other backends, although the FFI can still be used to experiment with kernels that use these functions. The most ergonomic way for an Accelerate user to take advantage of these instructions would be for the PTX backend to decide where it's appropriate to use them; the PTX runtime should already know about warps in some capacity, since it needs to choose a reasonable thread block size for the computation.
It's that time of year again, and having done some work with Nvidia's new hardware sporting "TensorRT" cores, I thought it'd make a nice project to evaluate whether or not Accelerate programs would benefit from these new execution resources and, if they do, start emitting code with accelerate-llvm-ptx that takes advantage of them.
The shiny new "TensorRT" cores are nothing more than dedicated execution resources for doing fused-multiply-accumulate on arrays of a fixed size and type. Nvidia calls their intrinsics for these operations WMMA or warp-level matrix multiply accumulate. LLVM gained support for them here and Nvidia's documentation for the ISA is here. I believe CUDA version 10 is required to launch kernels that use these intrinsics.
On paper(or rather, in my head) this should be fairly straightforward to implement. The PTX CodeGen needs to look for matrix operations with types and array shapes that are compatible with WMMA execution resources available on the targeted compute capability. If compatible expressions are found, WMMA intrinsics may be emitted for them. It's not obvious (at least to me) that typical (i.e. not-contrived-for-WMMA) accelerate programs are well positioned to take advantage of these intrinsics, so benchmarking is in order as well.
Although I've had the privilege to use both accelerate and LLVM quite extensively in my work, my model of how accelerate codegen works is limited to what I've been able to glean from Trevor's papers, so I might be barking up the wrong tree entirely. If others think this is a good idea I'd be happy to offer to mentor a student.