ddemidov / amgcl

C++ library for solving large sparse linear systems with algebraic multigrid method
http://amgcl.readthedocs.org/
MIT License
738 stars 112 forks source link

Difference between using make_shared on distributed matrix vs. solver #248

Open allaffa opened 1 year ago

allaffa commented 1 year ago

I see that there are different options proposed to solve a linear system with a distributed matrix in different examples.

Option 1 is described in this example: https://github.com/ddemidov/amgcl/blob/master/examples/mpi/mpi_solver.cpp

  1. Reading chunks of a matrix across different processes
  2. Use make_shared on the solver
solve = std::make_shared<Solver>(comm,
                amgcl::adapter::block_matrix<val_type>(std::tie(chunk, ptr, col, val)),
                prm, bprm);

Option 2 is described in this example: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html

  1. Reading chunks of a matrix across different processes
  2. Create the distributed matrix from the local parts.
    auto A = std::make_shared<amgcl::mpi::distributed_matrix<DBackend>>(
            world, std::tie(chunk, ptr, col, val));
  3. Initialize the solver without calling make_shared anymore:
    prof.tic("setup");
    Solver solve(world, A);
    prof.toc("setup");

I have two questions to ask:

Question 1: Are the two options equivalent, or should one be preferable over the other depending on the situation (e.g., scale, sparsity pattern, ...)? Question 2: In the example https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html, the matrix is re-partitioned before being passed as argument to the solver's constructor. Is this repartition needed, or can one use the original partition as well?

if (world.size > 1) {
        prof.tic("partition");
        Partition part;

        // part(A) returns the distributed permutation matrix:
        auto P = part(*A);
        auto R = transpose(*P);

        // Reorder the matrix:
        A = product(*R, *product(*A, *P));

        // and the RHS vector:
        std::vector<double> new_rhs(R->loc_rows());
        R->move_to_backend(typename DBackend::params());
        amgcl::backend::spmv(1, *R, rhs, 0, new_rhs);
        rhs.swap(new_rhs);

        // Update the number of the local rows
        // (it may have changed as a result of permutation):
        chunk = A->loc_rows();
        prof.toc("partition");
    }

Thank you again for your support and timely response to the questions :-)

ddemidov commented 1 year ago

The difference between using shared_ptr and not is just a matter of convenience. Look closely into what's happening in the example on your first link: the solver is constructed inside the function and is returned outside for further use. Shared pointer allows to convenientrly transfer ownership in this case, so that the solver is properly destroyed and no memory is leaked when it is no longer used. Also, it makes sure that just the pointer to the solver is copied around instead of the large structure. Internaly, the two options are completely equivalent.

Re-partitioning is the more beneficial the larger the number of your MPI processes. You can get away with naive partitioning (just reading chunks of consecutive rows of the input system) for a small (probably up to 8-16) number of processes. When the number of processes grows, the ratio of boundary-to-internal points on the subdomains grows, so that the communication becomes the bottleneck of the solution. Also, using proper partitioning improves the convergence speed and reduces the number of required iterations. This is briefly discussed around Fig. 3 here: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html.