MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

AMReX-Combustion / PeleLM

An adaptive mesh hydrodynamics simulation code for low Mach number reacting flows

https://amrex-combustion.github.io/PeleLM/

Other

83 stars 41 forks source link

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. #234

Open WHEREISSHE opened 2 years ago

WHEREISSHE commented 2 years ago

Hi,there. When I running Exec/RegTests/EB_FlamePastCylinder, the make process went well but something work not properly during ./PeleLM3d.gnu.MPI.ex inputs.3d-regt. The error occured with "amrex::Abort::0::MLMG failed !!! SIGABRT See Backtrace.0 file for details MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them."

esclapez commented 2 years ago

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

WHEREISSHE commented 2 years ago

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:
mac_proj.verbose  = 2
nodal_proj.verbose  = 2
and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

Thank you! I followed your instruction, but it didn't work properly with the notion---MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12 amrex::Abort::0::MLMG failed !!! SIGABRT Should I tune other parameters? More specificly, how could I find suitable parameters to be optimized?

WHEREISSHE commented 2 years ago

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:
mac_proj.verbose  = 2
nodal_proj.verbose  = 2
and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

It seemed worked properly when I increased the tolerance to 1.0e-8. But I still have no idea if this value is suitable. Actually, I am wondering how to choose good values for tolerance and verbose. Thanks.

esclapez commented 2 years ago

So, if you keep the verbose to 2, the standard output will get significantly longer but you will be able to keep track of the linear solver(s) behavior. When it comes to tolerances, the one you mostly want to control is the relative one:

mac_proj.rtol = 1e-10
nodal_proj.rtol = 1e-10

And in my experience, going higher than 1e-9 might indicates that something is wrong in the setup, unless you have added multiple levels and have very fine grids. From the message you pasted above,

MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12

the relative tolerance hanged ~1e-12, so going to 1e-10 should be relaxing the constraint enough for the solver to move forward.