Solving on distributed system

saurabh636 commented 3 years ago

Hi Dennis

I am assembling FE equations locally on each processor (or thread), what ingredients do I need to solve the system in parallel? it appears to be solving on each proc independently.

I have the local to global map but cannot figure out how to provide this to the solver, which it needs to solve on all procs in tandem.

Can you provide any pointers? examples?

Best, -saurabh

ddemidov commented 3 years ago

Have you seen this? https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html

saurabh636 commented 3 years ago

Hi Denis

Thanks, i have seen that example, but i am looking for running shared memory parallell not mpi.

Anything?

Thanks Saurabh

On Sun, Jun 20, 2021, 11:52 PM Denis Demidov @.***> wrote:

Have you seen this? https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ddemidov/amgcl/issues/204#issuecomment-864776296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANYVDWEGDH4KAHK3EJHAU3TT3ORHANCNFSM47A6QHSQ .

ddemidov commented 3 years ago

I am sorry I do not understand you then. Shared-memory-parallel sounds like openmp to me: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3Db.html

saurabh636 commented 3 years ago

My apologies, the situation is indeed a bit complicated, what I have is a domain decomposition setup for FE, but it does not directly rely on MPI, instead it is a neutral/transparent interface/API to both SMP and MPI. I can setup FE matrices on each proc or thread, but cannot for example use amgl/mpi since I cannot pass on a "world" to the solver. I guess what I need is to re-implement AMGCL/mpi for my native interface. You think this is feasible? again, any pointers?

ddemidov commented 3 years ago

instead it is a neutral/transparent interface/API to both SMP and MPI

If you have MPI under the abstraction, maybe it is possible to somehow get the MPI communicator from the abstraction? Another option (if you want to test whether amgcl is able to solve your problem before diving into the code) would be to save the systems assembled for each subdomain and feet those to examples/mpi/mpi_solver in matrix market format:

https://github.com/ddemidov/amgcl/blob/720f6b7bc325a8a87df9ae33da125a80069b1d5c/examples/mpi/mpi_solver.cpp#L440-L449

I guess what I need is to re-implement AMGCL/mpi for my native interface. You think this is feasible? again, any pointers?

It should be possible, but not easy. You will have to look into the amgcl/mpi source code for any pointers. You will probably need to abstract away (or reimplement) all of the MPI calls there.

saurabh636 commented 3 years ago

Unfortunately, I cannot access MPI communicator, but I will give re-implementation using native API a shot.

saurabh636 commented 3 years ago

Hi

For AMGCLs parallel solver, examples suggest using the AMGCL's distributed matrix

auto A = std::make_shared<amgcl::mpi::distributed_matrix<Backend>>(comm, local, remote);

where local and remote matrices are "crs" types. Typically, in FE setting, one has a local matrix which contains remote values for inter-proc boundaries and a local to global dof-map, however, this paradigm forces splitting the matrix into local only and remote only components, is that necessary? Is there a way to work with composite local matrix?

Also, I would like to use a user defined matrix that implements rows_impl<> and works in serial, for distributed_matrix as well, how can I do that?, since it appears distributed_matrix is initialized with a "crs" derived type only.

Thanks for your feedback.

ddemidov commented 3 years ago

You can just send local strips of the system matrix to each process (where each strip contains sequential chunk of rows, and the columns are globally numbered): https://github.com/ddemidov/amgcl/blob/720f6b7bc325a8a87df9ae33da125a80069b1d5c/amgcl/mpi/make_solver.hpp#L90-L99

The matrix will be split into local and remote parts behind the scenes. This should also accept a matrix adapter (including your own, that implements rows_impl<>).

saurabh636 commented 3 years ago

Does a row need to be entirely on a single proc? or could it be split across procs as well.

My matrix adapter is based on CRS as well which contains, (Globalrow -> GlobalCols -> Vals) triplet on each proc. A given (GlobalRow, Globalcol) may repeat on multiple procs which implies the 'Vals' need to be added up for that pair of row and column. Will that work?

ddemidov commented 3 years ago

Does a row need to be entirely on a single proc? or could it be split across procs as well.

Yes, each row has to be owned by a single process.

My matrix adapter is based on CRS as well which contains, (Globalrow -> GlobalCols -> Vals) triplet on each proc. A given (GlobalRow, Globalcol) may repeat on multiple procs which implies the 'Vals' need to be added up for that pair of row and column. Will that work?

No, you will have to finish the assembly (add the same nonzero values across different processes) before sending it to amgcl.

saurabh636 commented 3 years ago

OK, I can ensure the assembly perhaps, however, not sure if it will be efficient to gather the entire row on a single processor, this partition naturally occurs in a FE system, is there a way to allow for this type of (block) partition in amgcl?

From what I reckon, internally in amgcl strips of rows are in fact converted into blocks before solving, isn't that true?

ddemidov commented 3 years ago

is there a way to allow for this type of (block) partition in amgcl?

No, there is not.

From what I reckon, internally in amgcl strips of rows are in fact converted into blocks before solving, isn't that true?

No. amgcl does separate local and remote columns (diagonal and off-diagonal blocks), but it does not transfer the ownership of columns to other processes. A row is completely owned by a single processor.

saurabh636 commented 3 years ago

If I am able to presplit the matrix into local and remote parts, do I still need to ensure a contiguous chunk of rows belong entirely to a single processor?

Locating an entire row on a single processor appears to be way too much a restriction for a general solver.

Just to let you know, I am able to solve now after taking a huge comm. hit, reshuffling rows back to respective procs and before passing the matrix to AMGCL.

ddemidov commented 3 years ago

do I still need to ensure a contiguous chunk of rows belong entirely to a single processor?

yes

Locating an entire row on a single processor appears to be way too much a restriction for a general solver. Just to let you know, I am able to solve now after taking a huge comm. hit, reshuffling rows back to respective procs and before passing the matrix to AMGCL.

Somedbody has to do the reordering anyway. You can try to express the matrix permutation operation in terms of matrix-matrix product, similar to how it is done here:

https://github.com/ddemidov/amgcl/blob/681bfc71edf6c3721109bd0f263b33c92ec3ba2e/examples/mpi/mpi_solver.cpp#L210-L212

If that works better than your own implementation, then we can see if it is possible to make it generic enough in order to include it in the library API.

ddemidov / amgcl

Solving on distributed system #204