MPI - Different behavior with 1 vs. 2 ranks

DABH commented 4 years ago

Hi @ddemidov -- first off thanks for all your efforts in developing and maintaining AMGCL!

I'm trying to use AMGCL (with the VexCL backend) in a project where I'm solving Poisson systems with Neumann boundary conditions. First I tried the non-MPI interfaces of AMGCL and everything worked quite well in terms of convergence/performance. However, due to memory constraints I need to use the MPI interface to distribute the solve across multiple nodes.

I believe I setup my system correctly for MPI. I permuted the rows and columns of my original global system such that each of the partitioned systems owns contiguous rows in the permuted system. (Does this sound correct?) I verified manually that if I vertically stack my partitioned columns and apply the inverse permutation, I get my original global system back, so I believe the setup is working correctly.

Also, can you confirm what should be the size of x and b when I do the MPI solve? Should they have local sizes (i.e. N / # ranks), or should they have size of the global x and b (i.e. N)? I am using the global size right now.

When I solve the system with one MPI rank (which should be equivalent to not using MPI), I get a good solution vector, whose entries are on the order of roughly 10^-1 to 10^1.

However, when I solve the system with two MPI ranks, the solution vector I get is on the order of 10^15, and even though AMGCL reports it "converged," clearly something went wrong. It is strange to me since I think AMGCL should be doing the exact same thing regardless of the number of ranks, i.e. ghost exchanges are done when necessary so that the overall computation is the same, just distributed.

Do you have any idea what could be the source of the different behavior here, or how to go about debugging this?

In terms of solver/preconditioner settings, I'm using CG and

    typedef amgcl::mpi::make_solver<
        amgcl::runtime::mpi::preconditioner<Backend>,
        amgcl::solver::cg
        > Solver;
...
    boost::property_tree::ptree prm;
    prm.put("solver.type", "cg");
    prm.put("solver.tol", 1e-4);
    prm.put("solver.maxiter", 250);
    prm.put("precond.coarsening.type", "smoothed_aggregation");
    prm.put("precond.relax.type", "chebyshev");
    //prm.put("precond.coarse_enough", 1);
    //prm.put("precond.max_levels", 4);
    //prm.put("precond.direct_coarse", false);
    //prm.put("solver.direct_coarse", false);
    prm.put("precond.class", "relaxation");

(I've tried some of the commented out ones but haven't found anything that makes a difference.)

I know that my system is a little tough since it does have a rank one nullspace (spanned by the constant vector), but AMGCL with no MPI seems to have no problem with this, so I'm not sure if it's really related to why MPI with > 1 rank doesn't converge to a good solution. As a test, I edited amgcl/solver/cg.hpp to add the following after L178 (using latest master):

                scalar_type mean = std::sqrt(inner_product(x, x));
                backend::axpby(one, x, -mean, x);

because keeping the mean of the solution iterate equal to 0 should prevent CG from drifting into the nullspace. However, this didn't seem to help either.

I'd be very grateful for any help you can provide in getting to the bottom of this. Happy to provide any more details, sample matrices, etc., whatever might be helpful. Thanks so much in advance!

ddemidov commented 4 years ago

I'm trying to use AMGCL (with the VexCL backend) in a project where I'm solving Poisson systems with Neumann boundary conditions.

For Neumann conditions, you should set amg parameter direct_coarse=false, so that it does not try to solve the coarsest level directly.

I believe I setup my system correctly for MPI. I permuted the rows and columns of my original global system such that each of the partitioned systems owns contiguous rows in the permuted system. (Does this sound correct?)

It does.

Also, can you confirm what should be the size of x and b when I do the MPI solve? Should they have local sizes (i.e. N / # ranks), or should they have size of the global x and b (i.e. N)? I am using the global size right now.

x and b should have local size. That is, the number of elements in x an b on each MPI process should be equal to the number of rows in the local matrix strip:

However, when I solve the system with two MPI ranks, the solution vector I get is on the order of 10^15, and even though AMGCL reports it "converged," clearly something went wrong. It is strange to me since I think AMGCL should be doing the exact same thing regardless of the number of ranks, i.e. ghost exchanges are done when necessary so that the overall computation is the same, just distributed.

There are some differences in how serial and distributed solvers work, so you would not get bit-by-bit equal solutions, but the difference should not be as big as you state.

Do you have any idea what could be the source of the different behavior here, or how to go about debugging this?

I would try to save the system matrix to the matrix market format (you can do that with amgcl::io::mm_write()), and see if examples/mpi/mpi_amg.cpp demonstrates the same behaviour as your implementation. If it does not, then you probably need to look at how you are forming the inputs.

ddemidov commented 4 years ago

As a test, I edited amgcl/solver/cg.hpp to add the following after L178 (using latest master): scalar_type mean = std::sqrt(inner_product(x, x)); backend::axpby(one, x, -mean, x); because keeping the mean of the solution iterate equal to 0 should prevent CG from drifting into the nullspace.

I don't think this works as intended. First, m=sqrt(<x,x>) gives you L2 norm instead of mean (you could use that in x = x / m to make x have unit norm). And even if you got m correctly as mean of x, axpby(1, x, -m, x) is equivalent to x -= m x, while you want x -= m. There is no backend operation in amgcl to help you do that, but since you are using vexcl backend, you could do

vex::Reductor<double, vex::SUM> sum(x.queue_list());
x -= sum(x) / x.size();

ddemidov commented 4 years ago

boost::property_tree::ptree prm;
prm.put("solver.type", "cg");
prm.put("solver.tol", 1e-4);
prm.put("solver.maxiter", 250);
prm.put("precond.coarsening.type", "smoothed_aggregation");
prm.put("precond.relax.type", "chebyshev");
//prm.put("precond.coarse_enough", 1);
//prm.put("precond.max_levels", 4);
//prm.put("precond.direct_coarse", false);
//prm.put("solver.direct_coarse", false);
prm.put("precond.class", "relaxation");

Keep in mind that precond.class=relaxation gives you single-level (non-amg) preconditioner, and since you don't set precond.type to anything, you are using the default spai0.

DABH commented 4 years ago

Thank you so much for the fast and extremely helpful reply!

I used your suggestion (my mistake earlier with the mean!)

vex::Reductor<double, vex::SUM> sum(x.queue_list());
x -= sum(x) / x.size();

and this actually did get me reasonable looking results (right order of magnitude) for the CG solve on multiple MPI ranks. So that is good news. However there are still some discrepancies between the solutions I get with 1/2 ranks so I'm still debugging.

When I tried to build the mpi_amg example with the above included in cg.hpp, I got errors like the following:

/home/foo/amgcl/amgcl/solver/cg.hpp:182:55: error: ‘class amgcl::backend::numa_vector<double>’ has no member named ‘queue_list’
                 vex::Reductor<double, vex::SUM> sum(x.queue_list());

I tried writing the code in a form without the Reductor but that also led to compilation issues.

Do you have any idea how to make those two lines work with the combinations of compile settings used by mpi_amg? It seems like it could be a useful feature if it allows systems like mine to be solved better. In case it's helpful, here are my configuration settings:

OMPI_CXX=g++-7 OMPI_CC=gcc-7 cmake . -DAMGCL_BUILD_EXAMPLES=ON -DBOOST_ROOT=/home/foo/boost_1_71_0 -DBOOST_LIBRARYDIR=/home/foo/boost_1_71_0/stage/lib -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx

ddemidov commented 4 years ago

When I tried to build the mpi_amg example with the above included in cg.hpp, I got errors like the following:

mpi_amg target builds the example for the builtin backend (where vexcl operations are not supported). You need to build mpi_amg_vexcl_cl or mpi_amg_vexcl_cuda target for your modification to work. However, I would first try to see if the example works with your problem without any modifications.

ddemidov commented 4 years ago

One more thing re computing mean(x) : the approach above will give you mean of the local part of x, so you'll need to add some mpi operations to compute the global value out of that.

DABH commented 4 years ago

Thanks so much @ddemidov , and sorry for my slow reply (I was performing a lot of verifications on my code before resolving this issue). I have now been able to get my MPI solves to give me back the same solutions with different ranks and also the same as without MPI. So the issue can be closed.

I did get the zero-mean-shifting working for CG and have it in my fork (https://github.com/DABH/amgcl/commits/master). It only seems to compile with the VexCL backend; I'm not sure if you'd have any interest in making this a properly supported feature of AMGCL (working for all backends) but feel free to take any of my code if you want.

Many thanks again for the great help!

ddemidov commented 4 years ago

Hi David,

I'm not sure if you'd have any interest in making this a properly supported feature of AMGCL (working for all backends) but feel free to take any of my code if you want.

Can you attach an example of the matrix that benefits from this, so I could test it with other backends?

ddemidov commented 4 years ago

Can you attach an example of the matrix that benefits from this, so I could test it with other backends?

(Mostly to bump this up) You can use amgcl::io::mm_write to save the matrix you are sending to amgcl into the matrix market format.

EDIT: make sure you are running on a single MPI rank if you do this.

DABH commented 4 years ago

Thanks - will try to supply an example system in the next few days

ddemidov / amgcl

MPI - Different behavior with 1 vs. 2 ranks #125