how to perform matrix construction in GPU deveces without the data transfer

ztdepztdep commented 4 years ago

I need to compute matrix A every time step. it is very time cost to transfer data between GPU and CPU. So how to construct the matrix A on the GPU directly.

ddemidov commented 4 years ago

The setup phase of amgcl is performed on the CPU, so if you are rebuilding the AMG hierarchy on each time step, then you have to send the matrix to the amgcl constructor in the CPU memory. If you are reusing the same preconditioner during multiple time steps (and using solve(A, f, x) overload as opposed to solve(f,x)), then you could in principle construct the matrix on the GPU directly. Note that matrix constructors in both cuda and vexcl backends still take the matrix as CRS arrays in CPU memory, but you can use your own matrix structure as long as you specialize backend::spmv_impl and backend::residual_impl templates.

klausbu commented 4 years ago

@ztdepztdep

please let me know, when you find an "on-GPU" (limiting PCIE data transfers to the first timestep + domain boundary data GPU<->GPU data exchange), sparse linear algebra library with MPI support for distributed memory, multi GPU simulations. I am also looking into that.

oryabkov commented 3 years ago

The setup phase of amgcl is performed on the CPU, so if you are rebuilding the AMG hierarchy on each time step, then you have to send the matrix to the amgcl constructor in the CPU memory. If you are reusing the same preconditioner during multiple time steps (and using solve(A, f, x) overload as opposed to solve(f,x)), then you could in principle construct the matrix on the GPU directly. Note that matrix constructors in both cuda and vexcl backends still take the matrix as CRS arrays in CPU memory, but you can use your own matrix structure as long as you specialize backend::spmv_impl and backend::residual_impl templates.

Hello! I have two additional questions:

As far as I understand your answer, AMG hierarchy performed in amg constructor can only be performed on CPU (regardless of backend). But what about rebuild(...) method (which is something similar to AMGX_matrix_replace_coefficients as I understand)? Can we somehow update coarse system matrices leaving transfer operators as is and without moving anything on host?
In second part of your answer you suggest to create user matrix type and implement some operations to use it in solver with AMG preconditioner generated on first time step (if we are talking about some transient solver). Is it really necessary to create another matrix class? Can we use already existing vexcl (for example) matrix that we generated on first time step and change its coefficients directly on GPU (get plain data pointers from these classes somehow)?

Thank you in advance and good luck with future library development.

ddemidov commented 3 years ago

The rebuild method is still working on the CPU side, so you need to have the matrix available there.

When precond.allow_rebuild is set to true, the CPU representation of the transfer matrices P and R is kept alongside their GPU representation, and is used during the rebuild to propagate the new system matrix down the hierarchy using the Galerkin operator A_c = R A_f P. The matrix-matrix products in the Galerkin operator are performed on the CPU side, and the results are moved into the GPU memory. So there is no shortcut available that would allow to directly modify the system matrix in the GPU memory.

ddemidov / amgcl

how to perform matrix construction in GPU deveces without the data transfer #164