[DRAFT] GPU codegen backend

philipportner commented 1 year ago

This is a draft PR, not intended to be merged atm.

The PR description is also WIP and will be updated with more information.

Description

This PR introduces an experimental code generation pipeline to target the GPU. Instead of lowering DaphneOps to calls to the precompiled CUDA/cuDNN kernels, this work aims to generate code that can be lowered to target the GPU directly.

Currently only the daphne.matMul operation is supported. In the GPULoweringPass the daphne.matMul operation is rewritten to a set of operations from the Linalg, MemRef and GPU dialect.

Input to GPULoweringPass

%26 = "daphne.matMul"(%12, %24, %25, %25) : (!daphne.Matrix<?x?xf32>, !daphne.Matrix<?x?xf32>, i1, i1) -> !daphne.Matrix<?x?xf32>

Output of GPULoweringPass

%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() : memref<256x256xf32>
%11 = "daphne.createDaphneContext"(%0) : (ui64) -> !daphne.DaphneContext
"daphne.createCUDAContext"() {cuda_device = 0 : i32} : () -> ()
%12 = "daphne.randMatrix"(%4, %4, %6, %5, %3, %9) : (index, index, f32, f32, f64, si64) -> !daphne.Matrix<256x256xf32:sp[1.000000e+00]>
%13 = "daphne.randMatrix"(%4, %4, %6, %5, %3, %10) : (index, index, f32, f32, f64, si64) -> !daphne.Matrix<256x256xf32:sp[1.000000e+00]>
%14 = "daphne.convertDenseMatrixToMemRef"(%12) : (!daphne.Matrix<256x256xf32:sp[1.000000e+00]>) -> memref<256x256xf32>
%15 = "daphne.convertDenseMatrixToMemRef"(%13) : (!daphne.Matrix<256x256xf32:sp[1.000000e+00]>) -> memref<256x256xf32>
linalg.fill ins(%cst : f32) outs(%alloc : memref<256x256xf32>)
%cast = memref.cast %14 : memref<256x256xf32> to memref<*xf32>
%cast_0 = memref.cast %15 : memref<256x256xf32> to memref<*xf32>
%cast_1 = memref.cast %alloc : memref<256x256xf32> to memref<*xf32>
gpu.host_register %cast : memref<*xf32>
gpu.host_register %cast_0 : memref<*xf32>
gpu.host_register %cast_1 : memref<*xf32>
linalg.matmul ins(%14, %15 : memref<256x256xf32>, memref<256x256xf32>) outs(%alloc : memref<256x256xf32>)
%intptr = memref.extract_aligned_pointer_as_index %alloc : memref<256x256xf32> -> index
%16 = "daphne.convertMemRefToDenseMatrix"(%intptr, %c0, %c256, %c256, %c256, %c1) : (index, index, index, index, index, index) -> !daphne.Matrix<256x256xf32:sp[1.000000e+00]>

The gpu.host_register op registers the memref for access from the device. The inputs to these gpu.host_register ops have to be cast to an unranked memref. We create linalg.fill and linalg.matmul ops which will be lowered by the pipeline. Additionally, we insert input and output conversion to, and from our C++ runtime.

After this initial rewrite, the IR is passed to a lowering pipeline to produce code that can directly be run on the GPU. This includes lowering to the GPU dialect, producing NVVM IR, and generation of CUBIN which is embedde in the IR as attribute.

Pipeline visualization

```mermaid graph TD; Input-->LowerToGPU; LowerToGPU--> LinalgToParallelLoops; LinalgToParallelLoops --> RewriteToCallKernel; RewriteToCallKernel-->GPUMapParallelLoops; GPUMapParallelLoops-->ParallelLoopsToGPU; ParallelLoopsToGPU-->GPUKernelOutlining; GPUKernelOutlining-->LowerAffine; LowerAffine-->ArithToLLVM; ArithToLLVM-->CanonicalizeAndCSE; CanonicalizeAndCSE-->SCFToCF; SCFToCF-->CFToLLVM; CFToLLVM-->LowerGPUOpsToLLVMOps; LowerGPUOpsToLLVMOps-->ArithToLLVM; ArithToLLVM-->GpuSerializeToCubinPass; GpuSerializeToCubinPass-->LowerToLLVM; LowerToLLVM-->ReconcileUnrealizedCasts; ```

At the moment no optimizations are applied to the matmul operator resulting in a less than optimal performance. I'll add optimization passes to improve performance.

The pipeline is currently not very stable, changes in the order of passes can introduce problems, I hope to either make the pipeline simpler or make it more obvious in which order stuff has to happen.

How to run

LLVM needs to be compiled with the NVPTX target enabled.
-DCMAKE_CUDA_COMPILER needs to be set for that (these are currently hardcoded in the PR)
DAPHNE needs to be build with CUDA support

Minimal working example

Note: currently only tested/supported are single precision matrices and simple scripts (the provided example).

Run by executing bin/daphne --cuda gpu.daphne

N = 256;
X = rand(N, N, as.f32(0.0), as.f32(1.0), 1, 1);
V = rand(N, N, as.f32(0.0), as.f32(1.0), 1, 2);

RES = X @ V;
print(RES[0,0]);
//print(RES);

TODO

[ ] Some hardcoded values in build.sh
[ ] Hardcoded target triple (mlir::createGpuSerializeToCubinPass("nvptx64-nvidia-cuda", "sm_86", "+ptx76"))
[ ] Lots of temporary code and comments
[ ] Provide benchmarks
[ ] Improve perf
[ ] Remove limitations on types
[ ] Remove limitations on other ops
[ ] Provide switch between CUDA precompiled and codegen
[ ] Tests
[ ] Documentation
[ ] Support for multiple GPUs, currently only tested on an RTX A2000 with Driver Version: 545.23.06, CUDA Version: 12.3

philipportner commented 1 year ago

Hey @corepointer, I think this could be interesting for you :) In case you take a look, please provide feedback for improvements. Lots of GPU knowledge I'm missing (e.g., see the TODO for the target triple) so I'd appreciate your help!

corepointer commented 12 months ago

Hi Philipp! Yes, this definitely looks interesting to me :) I'll have a closer look after the GA next week. As for the hard coded target triple - there either is a way to query the available hardware at run-time or some way to create a "fat binary" to include all the targets you want to support. So much I can tell from the top of my head :-P Best, Mark

CPestka commented 6 months ago

The runtime query works via cudaGetDeviceProperties() see. The sm version, I think, is just the major and minor version concatenated. I haven't found some direct way to query the ptx version and apparently there will likely not be one see, but it seems they are backwards compatable in respect to ptx->sass, so choosing a specific reasonably old ptx version for all targets is probably a reasonable solution, unless you/llvm need something specific from a newer ptx version.

daphne-eu / daphne