OpenCL porting? - Githubissues

Foadsf commented 6 years ago

I was wondering if there any chance that we might get an OpenCL porting? It is crossplatform and vendor natural, so it respects the values of Free and Open Source world.

daniel-jasinski commented 6 years ago

Porting to OpenCL is extremally difficult and only way I can see it even close to doable is by using some high level API like boost compute.

Something much more realistic would be porting it to Clang GPU compiler.

Foadsf commented 6 years ago

@daniel-jasinski it is difficult, true. but the sooner we invest in it the better. Using CUDA kind of defeats the whole purpose of Free Software. For Clang GPU compiler, I actually don't have that much of hope, if that's what you are referring to. I have a desprate hope that people will start developing the open source implementations of OpenCL.

There are also more high level wrappers/libraries based on OpenCL CLBlast, clBLAS, clMAGMA, EasyOpenCL, ArrayFire and ViennaCL to mention some.

daniel-jasinski commented 6 years ago

None of the OpenCL libraries offer the same programming capabilities that CUDA + Thrust have. Those libraries are fine when you intend to port only a small part of the application like the linear solver. But for the entire large application like OpenFOAM it is really not feasible. It will be extremally difficult to develop and the code maintenance will be a real horror.

For me much more important would be getting implementation up to speed with the latest OpenFOAM and maintaining compatibility further on. I do not think it would be possible with OpenCL, even with a much better programming model like CUDA it is going to be hard.

Either way, if someone would be interested in this endeavour I would be happy to help with my knowledge of the domain.

wyldckat commented 6 years ago

@Foadsf Since you commented about this report on the announcement thread at CFD-Online, then please allow me to ask you: Do you know of any OpenCL development software which allows using pragmas to tell the compiler which parts of the code should be running on the CPU or on the GPU?

Foadsf commented 6 years ago

@wyldckat I'm not sure what you mean by OpenCL development software, but one can use OpenCL kernels to do specific calculations on the GPU. Why are you asking? could you be more specific?

wyldckat commented 6 years ago

What makes RapidCFD very different from all others is that instead of implementing dedicated libraries that only work with the GPU (most of the ones listed here), it uses pragmas... er, wait, sorry, it uses macro names such as __HOST____DEVICE__, to tell the compiler to know which parts of the code are CPU-bound and which are GPU-bound or both.

This allows to reduce the number of changes needed to the original OpenFOAM source code, while also having a more optimal way of controlling the code between CPU and GPU. See for example this file: https://github.com/Atizar/RapidCFD-dev/blob/f3775ac96129bfee68655e11e63ff5d62bccb4b9/src/OpenFOAM/primitives/Tensor/Tensor.H - if I'm not mistaken, this way it allows to have this same class structure on CPU and GPU, without having to create dedicated classes for each side and another one to transfer data between them.

So in other words, CUDA's nvcc compiler (if I remember correctly) does a lot of work in deciding how the code and memory is shared between CPU and GPU, which also improves performance, in comparison to other GPU-related implementations.

Hence my question: Do you know of any OpenCL development software that also provides this capability? More specifically, a compiler stack for OpenCL?

Because the remaining tricks are the same as for all others: Use optimized open source library toolkits that provide matrix solving on GPUs.

If you really are looking for an OpenCL implementation that works with OpenFOAM, there is already one called "PARALUTION", listed on the GPGPU page at openfoamwiki.net. The problem is that there is no clear feedback of how efficient it really is, in comparison to all other available implementations. The only detail provided by them is that it "works for nearly everything" (e.g. GPU and Xeon-Phi as well), namely it's not a dedicated solution for OpenFOAM, it simply provides plug-in matrix solvers for OpenFOAM.

Foadsf commented 6 years ago

As I'm no expert I will try to share what I know and maybe invite others who know more. I added you on Twitter for this reason.

So If I understand correctly you are using some form of directives to tell the nvcc compiler to run that specific part of the code on GPU. Kind of similar to OpenMP, OpenACC and numba. I don't think that there are any OpenCL implementations with such a feature. But :

There are OpenCL implementations of linear algebra APIs like BLAS and LAPACK which can easily be used to replace the current OpenFOAM dependencies. This should be ideally as easy as changing the headers.
There is AMD's ROCm which also includes a CUDA to C++ translator
AMD is apparently working on a GPU accelerated numba. Intel is also working on it. Numba is a python library though.

daniel-jasinski commented 6 years ago

The main idea behind RapidCFD was to fight of the Amdhal's law by running all the simulation stages on the GPU. This is inherently invasive change and without a high level programming API almost impossible for a large codebase like OpenFOAM.

Although I did not look very deep into it, AMD's ROCm looks very promising as a development platform. We could probably abstract away whether the code is compiled for CUDA or ROCm and have a multi platform project.

gstoner commented 6 years ago

@daniel-jasinski Hey I am the CTO at AMD for ROCm, also founded the project. Looking at your code it would painful to port this to OpenCL. I agree HIP, would be a better Programing Language foundation to port from CUDA to run on AMD GPU. Note we use it to port all the deep learning frameworks as well.

One thing we worked with LLVM Comunity on is now moving to bring HIP to formally supported by the CLANG/LLVM community. We start upstreaming the work back in March. With this update, we are also moving from HCC longer term to a standard upstream CLANG FE based for HIP. These FE Changes support all the CUDA FE compiler conventions, due to this work https://llvm.org/docs/CompileCudaWithLLVM.html Google did the development work to bring Cuda Kernel Compatibility to CLANG/LLVM. what was missing was Runtime to support other devices and some FE and base compiler work which we doing.

You will see, we already have up-streamed our based compiler https://llvm.org/docs/AMDGPUUsage.html. One thing we did is added native Assembler and Disassembler to our compiler. So we support Inline ASM in the compilers we develop. As you can imagine it helped us a lot with library work. There are other things like Tensile you might find interesting which we use generate Tensor and GEMM kernels for rocBLAS. You also see we filling in our library base as well. More to come soon.

We have a new server GPU and server CPU coming out which have a solid focus on HPC needs. That would be a good fit for OpenFoam and RapidCFD. I did a quick glance of the code. How tightly coupled is the main CPU program and offload portions of the Code. There are some new I/O enhancement for Host to Device and Device to Host that could help if you're more tightly coupled like GROMACs

I look at the code more in the morning after I get some sleep.

daniel-jasinski commented 6 years ago

@gstoner Thank you for your comment.

It is great that HIP is moving towards becoming part of standard CLANG. Taking advantage of the latest language features really improves productivity. I also noticed that you are working on HIP backend for Thrust, which is awesome because launching kernels manually is still a pain.

In RapidCFD GPU offloading is intertwined with the main program execution, because heavy-duty computation is done all throughout the simulation. Therefore the goal is to make switching from CPU to GPU execution very low cost in terms of code. Ideally it could require no code change at many call sites if the underlying code infrastructure is done right. For that C++ 20 Concepts would be of great help, but they are still not available in CLANG.

gstoner commented 6 years ago

Just so you see we made good progress Some of the document for HIP in CLANG 8
https://clang.llvm.org/docs/genindex.html • --hip-device-lib-path= o clang command line option • --hip-device-lib= o clang command line option • --hip-link o clang command line option

• https://clang.llvm.org/doxygen/HIP_8h_source.html

You can see the. call tree for HIP • https://clang.llvm.org/doxygen/HIP_8cpp.html

Source files • https://clang.llvm.org/doxygen/HIP_8cpp_source.html • https://clang.llvm.org/doxygen/HIP_8h_source.html

gstoner commented 6 years ago

@daniel-jasinski @wyldckat So I had to be hush-hush until now but we picked all Paralution IP, now have it ported over to ROCm as rocALUTION https://github.com/ROCmSoftwarePlatform/rocALUTION. , also released rocSPARSE library https://github.com/ROCmSoftwarePlatform/rocSPARSE

Also New in ROCm 1.9: New Profiler and Trace foundation libraries. rocProfiler. ROC profiler library. Profiling with perf-counters and derived metrics.

https://github.com/ROCmSoftwarePlatform/rocprofiler

PAPI support for rocProfiler

https://github.com/ROCmSoftwarePlatform/rocm-papi-component

rocTRACER ROC-tracer library, Runtimes Generic Callback/Activity APIs. The goal of the implementation is to provide a generic independent from specific runtime profiler to trace API and asynchronous activity.

https://github.com/ROCmSoftwarePlatform/roctracer

New in ROCm 1.9: ROCr Debug Agent foundation, for Debugger.

ROCr Debug Agent is a library that can be loaded by ROCm Platform Runtime to provide the following functionality: · Print the state of wavefronts that report memory violation or upon executing a "s_trap 2" instruction. · Allows SIGINT (ctrl c) or SIGTERM (kill -15) to print wavefront state of aborted GPU dispatches. · It is enabled on Vega10 GPUs on ROCm1.9.

FCLC commented 3 years ago

@daniel-jasinski @wyldckat So I had to be hush-hush until now but we picked all Paralution IP, now have it ported over to ROCm as rocALUTION https://github.com/ROCmSoftwarePlatform/rocALUTION. , also released rocSPARSE library https://github.com/ROCmSoftwarePlatform/rocSPARSE

Also New in ROCm 1.9: New Profiler and Trace foundation libraries. rocProfiler. ROC profiler library. Profiling with perf-counters and derived metrics.
* https://github.com/ROCmSoftwarePlatform/rocprofiler
PAPI support for rocProfiler
* https://github.com/ROCmSoftwarePlatform/rocm-papi-component
rocTRACER ROC-tracer library, Runtimes Generic Callback/Activity APIs. The goal of the implementation is to provide a generic independent from specific runtime profiler to trace API and asynchronous activity.
* https://github.com/ROCmSoftwarePlatform/roctracer
New in ROCm 1.9: ROCr Debug Agent foundation, for Debugger.
* https://github.com/ROCm-Developer-Tools/rocr_debug_agent

* https://github.com/ROCm-Developer-Tools/rocr_debug_agent/releases/tag/roc-1.9.0
ROCr Debug Agent is a library that can be loaded by ROCm Platform Runtime to provide the following functionality: · Print the state of wavefronts that report memory violation or upon executing a "s_trap 2" instruction. · Allows SIGINT (ctrl c) or SIGTERM (kill -15) to print wavefront state of aborted GPU dispatches. · It is enabled on Vega10 GPUs on ROCm1.9.

apologies to be reviving such an old topic, but I can't seem to find a definitive answer about using ROCm enabled Navi cards with OpenFoam for GPGPU compute performance uplift.

Do you know of any documented instances of an open foam pipeline working completely on current stable releases (presumably kernel 5.4, openfoam 2012, ROCm 4.x mainline etc.)

If this inst the best place to discuss more than happy to chat via email/messages twitter etc!

SimFlowCFD / RapidCFD-dev

OpenCL porting? #46