Open philipportner opened 1 year ago
Hey @corepointer, I think this could be interesting for you :) In case you take a look, please provide feedback for improvements. Lots of GPU knowledge I'm missing (e.g., see the TODO for the target triple) so I'd appreciate your help!
Hi Philipp! Yes, this definitely looks interesting to me :) I'll have a closer look after the GA next week. As for the hard coded target triple - there either is a way to query the available hardware at run-time or some way to create a "fat binary" to include all the targets you want to support. So much I can tell from the top of my head :-P Best, Mark
The runtime query works via cudaGetDeviceProperties() see. The sm version, I think, is just the major and minor version concatenated. I haven't found some direct way to query the ptx version and apparently there will likely not be one see, but it seems they are backwards compatable in respect to ptx->sass, so choosing a specific reasonably old ptx version for all targets is probably a reasonable solution, unless you/llvm need something specific from a newer ptx version.
This is a draft PR, not intended to be merged atm.
The PR description is also WIP and will be updated with more information.
Description
This PR introduces an experimental code generation pipeline to target the GPU. Instead of lowering DaphneOps to calls to the precompiled CUDA/cuDNN kernels, this work aims to generate code that can be lowered to target the GPU directly.
Currently only the
daphne.matMul
operation is supported. In theGPULoweringPass
thedaphne.matMul
operation is rewritten to a set of operations from theLinalg
,MemRef
andGPU
dialect.Input to
GPULoweringPass
Output of
GPULoweringPass
The
gpu.host_register
op registers the memref for access from the device. The inputs to thesegpu.host_register
ops have to be cast to an unranked memref. We createlinalg.fill
andlinalg.matmul
ops which will be lowered by the pipeline. Additionally, we insert input and output conversion to, and from our C++ runtime.After this initial rewrite, the IR is passed to a lowering pipeline to produce code that can directly be run on the GPU. This includes lowering to the GPU dialect, producing NVVM IR, and generation of CUBIN which is embedde in the IR as attribute.
Pipeline visualization
```mermaid graph TD; Input-->LowerToGPU; LowerToGPU--> LinalgToParallelLoops; LinalgToParallelLoops --> RewriteToCallKernel; RewriteToCallKernel-->GPUMapParallelLoops; GPUMapParallelLoops-->ParallelLoopsToGPU; ParallelLoopsToGPU-->GPUKernelOutlining; GPUKernelOutlining-->LowerAffine; LowerAffine-->ArithToLLVM; ArithToLLVM-->CanonicalizeAndCSE; CanonicalizeAndCSE-->SCFToCF; SCFToCF-->CFToLLVM; CFToLLVM-->LowerGPUOpsToLLVMOps; LowerGPUOpsToLLVMOps-->ArithToLLVM; ArithToLLVM-->GpuSerializeToCubinPass; GpuSerializeToCubinPass-->LowerToLLVM; LowerToLLVM-->ReconcileUnrealizedCasts; ```At the moment no optimizations are applied to the matmul operator resulting in a less than optimal performance. I'll add optimization passes to improve performance.
The pipeline is currently not very stable, changes in the order of passes can introduce problems, I hope to either make the pipeline simpler or make it more obvious in which order stuff has to happen.
How to run
NVPTX
target enabled.-DCMAKE_CUDA_COMPILER
needs to be set for that (these are currently hardcoded in the PR)Minimal working example
Note: currently only tested/supported are single precision matrices and simple scripts (the provided example).
Run by executing
bin/daphne --cuda gpu.daphne
TODO
build.sh
mlir::createGpuSerializeToCubinPass("nvptx64-nvidia-cuda", "sm_86", "+ptx76"))
Driver Version: 545.23.06, CUDA Version: 12.3