GPGPU-based High Performance Computation

doyubkim commented 8 years ago

Support GPGPU backend for faster computation. Ideally Nvidia CUDA, but OpenCL is another option.

doyubkim commented 6 years ago

This has been the most wanted feature for a long time. I think it is the right time to make some push on this issue, together with Issue #18.

This issue is not very different from Issue #18. In fact, the new architecture and API should be carefully designed so that the both goals can be achieved within a unified framework.

This requires some internal re-architecturing:

The computation code should be kernel-based rather than the current inline-based implementation. This means strict separation between high-level API set (such as solvers with OOP + functional interface) and low-level APIs (such as BLAS operations with C-style interface).
Existing data structures should be more abstract and agnostic to internal buffer types. Direct access and manipulations should be discouraged and must use the kernel-centric APIs.
Item 1 and 2 above should be achieved with minimal changes (ideally no breaking changes, but just a few additions) to the existing API.

My current (high-level) plan for the development is:

Prototype with simple solver (such as basic grid-based solver or particle solver) + CUDA/ISPC
Expand the prototype to more complicated solvers (such as FLIP or PCISPH).
Settle down both high and low-level API.
Prototype with other compute backend such as OpenCL, Metal, or Vulkan (but just one of them) and test the extensibility of the new API.
Finalize the API and all the implementation.

As you can see, the primary target backends are CUDA (for GPU) and ISPC (for CPU), and hopefully will add either one of OpenCL/Metal/Vulkan.

This will be a months-long effort and more like a set of multiple milestones, especially given that I'm doing this as a hobby with limited time + HW + etc. But it will be a fun project, and I really look forward seeing meaningful outcome in the end.

vtgstudio commented 6 years ago

... looking forward to this feature.

danielwinkler commented 6 years ago

Hi Doyub, will this development process be available in the repository (e.g. dev branch or public fork)?

doyubkim commented 6 years ago

@danielwinkler, I will push the intermediate work to a dev branch. Current dev branch has some GL work which is the groundwork for this GPGPU effort.

doyubkim commented 6 years ago

This issue is now under development in gpu branch (together with Issue #31). Nothing really useful at the moment, but hoping to get SPH implementation out soon.

akav commented 6 years ago

Would it be worth looking at OpenACC as a stepping stone on the path to GPGPU acceleration? Only reason I mention this, is because I was thinking if this project might be of help to generate OpenACC directives for existing source tree: https://github.com/gleisonsdm/DawnCC-Compiler

doyubkim commented 6 years ago

Thanks for the suggestion, @akav ! OpenACC is an interesting approach since it will make the code more readable. I am already working on SPH solver with CUDA, but it would be interesting to compare OpenACC and CUDA implementations in the end.

akav commented 6 years ago

When OpenACC branch is available, core should in theory be able to accelerate on FPGA as well and other OpenCL devices too.

subatomicglue commented 6 years ago

Down side of CUDA is no support on MacBook Pros with ATI graphics... I realize these aren't the graphics beasts of the world, but... :) Hoping for selectable simulation engine so I can fallback to C++ (SIMD or other fast approach on CPU). Slow is better than none...

akav commented 6 years ago

By adding an external GPU chassis (needs Thunderbolt3) with NVIDIA graphics adapter to your MacBook, core engine should deploy and run perfectly fine.

danielwinkler commented 6 years ago

my two cents regarding OpenACC

I really like the idea and started a project using it. However, as soon as you run into problems the high level of abstraction is prohibitive and I had to use stackoverflow to discuss this issues with the compiler developers (which are responding very fast).

One very annoying problem was that you couldn't have pointers in a struct (SOA), solution was to fall back to unified memory (beta), or write a temporary alias (e.g. double* tmp = mystruct.datapointer), see https://stackoverflow.com/a/32192069/827027 related question https://stackoverflow.com/questions/39095908/openacc-and-object-oriented-c

Trying samples shown at GTC were not only containing a bug, but also require you to use PGI specific extensions https://stackoverflow.com/questions/39139176/openaccarray-swap-function

Just to say, I really like the approach of OpenACC but 1 1/2 years ago it was quite immature when it came to data managment, although the support of PGI was excellent.

OpenCL on the other hand would run everywhere, although the tooling is not quite as comfortable as CUDA. A possible approach would be development in CUDA and then trying AMD HIP (HIP : C++ Heterogeneous-Compute Interface for Portability), which ideally boils down to including a header and using the AMD CUDA to OpenCL compiler. Unfortunately I did not test this yet, and there might be some CUDA primitives that are not yet supported, more information on https://developer.amd.com/heterogeneous-computing/

doyubkim commented 6 years ago

Thank you all for the discussion! Great to learn more about OpenACC. I think portability vs. productivity is always a hard question, especially on GPU framework given the currently available solutions. I will continue with CUDA in order to complete the feature as soon as possible, and look into either OpenCL or OpenACC for additional backend.

FYI, there's been some progress on WCSPH with CUDA lately. Code is still not well organized/optimized at all, but hoping to complete the SPH-family soon.

doyubkim commented 6 years ago

Update: Prototype version of 3D WCSPH and PCISPH are now implemented. The code is super hacky, not optimized, and may have some bugs. But hey, things do seem to work.

For the following weeks, I will spend some time cleaning up the code and adding 2D solvers. I'm trying to figure out when would be the best time to merge this branch back to master due to API breaking changes and feature coverage (merge particle sim only vs. everything altogether). I'm leaning toward

1) wrap up SPH family, 2) add PBF (Issue #16) which is also purely particle-based, 3) add doc/tests/examples, 4) merge to master and bump up the major version to v2.0 (API change), 5) work on grid-based/hybrid solvers targeting for v3.0+

Any suggestions are welcome.

giordi91 commented 6 years ago

Just to double check whats the current state of the feature? I am currently working through the book( currently working on my SPH implementation) I love gpgpu computing and that s what I do for work, I would love to help with that, maybe in the optimization stages? So just wanted to figure out the current stage of the branch from you before going in and have a look around.

M.

doyubkim commented 6 years ago

Thanks for checking in, @giordi91. Great to hear that GPGPU is one of your expertise! Once I clean-up the code (unoptimized), I think you can definitely help me out on the optimization. At this moment, the code is too messy, but allow me about a couple of weeks for the house cleaning.

doyubkim commented 6 years ago

Here are some updates and future plan for this task.

Multidimensional Arrays Lots of base code updates been made, mainly around multidimensional arrays. I'm trying to make simple, but robust array API that works for both CPU and GPU memories. Still work in progress, but near complete (except replacing existing CPU array API).
CUDA Texture Instead of using C++ polymorphism for representing arbitrary surface (which are mostly for collider and emitter), I'm going to use CUDA texture for implicit surfaces. CUDA does support polymorphism, but not really great in terms of perf. Input will be the same CPU surface object which is then cached into GPU texture for sim use. Basic texture code is just implemented and CUDA-collider will come next.
Additional solvers? I'm focusing on particle-based solvers for the first release (v2) and will add more solvers later. It would be really great if I can add PBF to v2 train, but we'll see.
Branch If anybody is using gpu branch, it may get rebased with the latest master (or v2 dev branch) soon, so you probably want to backup any changes you made.

Titaniumtown commented 4 years ago

When I build this it fails. It says that I am missing files in the include/jet/ directory. Even when I copy the required files from the master branch it still fails

doyubkim commented 4 years ago

Thanks for checking it out @Titaniumtown! The branch is quiet outdated at the moment so there could be number of issues building it. I will update this issue once I get any meaningful progress. The development has been very slow due to my personal matters especially during 2019.

Titaniumtown commented 4 years ago

Ok then, Thanks!

stevencui2 commented 1 year ago

Hello, wonder any progress on the GPU branch? When I build the GPU branch, many errors pop up. I was able to fix some easy and minor issues but cannot go further.

doyubkim commented 1 year ago

Hello, sorry for the much delayed response. The branch is unmaintained right now and supposed to be an experimental branch.

doyubkim / fluid-engine-dev

GPGPU-based High Performance Computation #19