Closed tzanio closed 2 years ago
With the announcement that OLCF Frontier will be AMD CPU/GPU, we should try to get it into our workflow. We can use HIP (an open source CUDA-like model that can compile to CUDA and ROCm) which can be almost automatically produced from CUDA (using hipify-clang) or OpenMP-5 offload as on-node programming models. Note that HIP does not currently support run-time compilation.
HIP nominally compiles to CUDA with negligible overhead, but the toolchain needs to be installed to do so.
OCCA:HIP supports run-time compilation.
Our OCCA backend is in serious need of a performance overhaul, so it would be great if we can also include OCCA:HIP.
Yes, I don't think anything special needs to be done for /gpu/occa/hip versus /gpu/occa/cuda, though the OCCA backend needs attention. My comment on run-time compilation was with regard to @YohannDudouit's native CUDA implementation.
I'm also curious about observed differences in performance characteristics between the Radeon Instinct and V100.
You should follow up with Noel Chalmers. I believe he has run libP experiments with the Radeon Instinct.
Thanks. @noelchalmers, can you share any experiments?
Hi everyone. I'll try and chip in what I know for some of the points in this thread:
In addition to hipify-clang, which ports existing CUDA code by actually looking at the code's semantics, there is also hipify-perl which is a simple script which can convert CUDA codes to HIP, and at least warn about sections it is unable to translate.
HIP does indeed support runtime compilation in the same way CUDA does. OCCA uses analogous API calls for its runtime compilation of CUDA and HIP. I know the documentation of what is/is not currently in the HIP API is a bit sparse at the moment. The HIP Porting Guide is a good resource for the moment.
As for V100 vs Radeon Instinct performance, in micro-benchmarking we've been seeing bandwidth numbers in the 800-900 GB/s range for the MI-60s and similar GFLOP numbers to the PCIe V100s.
I don't have any readily available performance numbers for any CEED-relevant benchmarking. My plan is to resurrect the bake-off problems in libp and do some performance analysis to get a better sense of what the Radeons can do compared to the V100s. Libp's kernels rely heavily on things like shared memory bandwidth and cache performance so it will be a good exercise in finding out how portable they are to Radeon.
Thanks, @noelchalmers. On run-time compilation, I don't see anything about porting NVRTC to HIP.
Are there any public clouds with Radeon Instinct (for continuous integration, etc.).
I just realized that you were referring to NVRTC when you mentioned runtime compilation.
No, HIP currently doesn't support any nvrtc* API calls. I'm not aware of any plans to add these features, but I will ask around. What HIP does support is loading compiled binaries using hipModuleLoad, which is analogous to cuModuleLoad, and finding/launching kernels from that binary.
I don't know of any public clouds I can point to using MI-25 or MI-60s yet. Maybe for some CI tests you could try compiling on some Vegas in a gpueater session? Not ideal, certainly.
Thanks. It looks like GPU Eater doesn't support docker-machine or Kubernetes so CI integration would be custom and/or not autoscaling, but it's something, so thanks.
Yet another C++ layer, this one providing single source for CPU, OpenCL, and HIP/CUDA. https://github.com/illuhad/hipSYCL
While I still don't see it on the docs website, hiprtc
was apparently merged a few months ago. https://github.com/ROCm-Developer-Tools/HIP/pull/1097
I thought we discussed this specifically at CEED3AM and @noelchalmers and Damon were not aware that it existed. Is it something we should be trying now, or is the lack of documentation indication that it's still in easter-egg mode?
I'll close this open-ended issue. There is an improved occa backend coming in #1043. I think at this point we can make new issues for specific backend requests.