NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
495 stars 143 forks source link

amgX as pure on GPU external, distributed memory, linear algebra library using MPI? #114

Open klausbu opened 4 years ago

klausbu commented 4 years ago

Hello,

I am not sure about the underlying multi GPU concept of amgX.

The application I have in mind has the following features:

Are the amgX solvers build on the GPU?

Is there an example of a distributed memory/MPI implementation to possibly leverage amgX as an external, pure on GPU matrix solver library?

Klaus

mattmartineau commented 3 years ago

Both the setup and solve phases in AmgX are performed on the GPU unless you tell AmgX to run on the host.

It is straightforward to run with GPU resident data structures using the same API calls you would for host resident data structures. As an example, if you were to use AMGX_matrix_upload_distributed then you can pass a device pointer or a host pointer for the CSR matrix column indexes, row offset and values.

marsaev commented 3 years ago

Hey @klausbu, do you have any follow-up questions after Matthew's comment?

klausbu commented 3 years ago

@Matthew

Let's assume we have an workstation with 4 GPUs and a CFD case large enough to keep them busy:

mattmartineau commented 3 years ago

The distributed model for this application leverages MPI and that is currently the standard option for such problems. You can domain decompose your problem, pass it to AmgX and the library can handle the communications necessary for the linear solve. You can look at https://github.com/barbagroup/AmgXWrapper to see some examples of how this could be setup for a CFD code.

Keeping data on the GPU is essentially dependent upon your particular application. For the CFD applications I have worked with, the most important aim from a performance perspective is usually to have the outer cycle/timestep loop processed in its entirety on the GPU. You pass the constructed matrices, velocities, pressures, volumes, or whatever quantities will be processed, to the GPU before the cycle and only bring data back at the end (of course some data is still communicated i.e. halos, scalars for PCG etc.). You are able to pass device pointers via the API to AmgX so it is therefore possible to avoid ping-pong movement of data CPU <-> GPU.

w.r.t We have worked on several projects related to OpenFOAM acceleration using AmgX. One internal project does indeed extend PETSc with an AmgX backend but this isn't released yet (no plan currently in place either).

Instead the public work with OpenFOAM + AmgX leverages PETSc4FOAM, which was developed by the OpenFOAM HPC technical committee (in particular CINECA + ESI). I extended this functionality to also call into AmgX and do some additional data transformations, the benefit being performance for the pressure solve. Work is still ongoing and we are optimising for an increasing range of test problems.