FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

MPI problem #331

Closed 100souci closed 3 years ago

100souci commented 3 years ago

Hello, One of our users is trying to run a test using one of your test models on our HPC cluster. Running this command: fluidity Stokes-Poiseuille.flml gives this output: An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [compute02:108476] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

I tried this command: mpiexec fluidity /home/user1/tests/Stokes_Poiseuille/Stokes-Poiseuille.flml in a Slurm batch with various values as the number of tasks but I get this error: ERROR Error message: Unable to open mesh/Rectangular_Pipe_7.msh (I get this message x times with x cores)

MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 16.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

[compute12:183890] 15 more processes have sent help message help-mpi-api.txt / mpi-abort [compute12:183890] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

We use OpenMPI 4.0.4 from HPC-X https://developer.nvidia.com/networking/hpc-x

Do you think this is due to our MPI or we don't launch the command correctly?

Many thanks in advance for your help.

jhill1 commented 3 years ago

Hi there,

The error message is not MPI related. Fluidity couldn't find the mesh specified in the FLML.

It's expecting to fin the mesh in mesh/Rectangular_Pipe_7.msh

so in the flml this will appear as mesh/Rectangular_Pipe_7

You may also get this error is the mesh format is incorrect (if I remember correctly). I generally used GMSH v 2.2 format (the mesh file says at the top):

$MeshFormat 2.2 0 8 $EndMeshFormat $Nodes 70344

So check that too.

HTH, J

On 27/07/2021 10:35, 100souci wrote:

Hello, One of our users is trying to run a test using one of your test models on our HPC cluster. Running this command: fluidity Stokes-Poiseuille.flml gives this output: An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [compute02:108476] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

I tried this command:
mpiexec fluidity
/home/user1/tests/Stokes_Poiseuille/Stokes-Poiseuille.flml
in a Slurm batch with various values as the number of tasks but I
get this error:
*** ERROR ***
Error message: Unable to open mesh/Rectangular_Pipe_7.msh (I get
this message x times with x cores)

MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 16.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[compute12:183890] 15 more processes have sent help message help-mpi-api.txt / mpi-abort [compute12:183890] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

We use OpenMPI 4.0.4 from HPC-X https://developer.nvidia.com/networking/hpc-x https://developer.nvidia.com/networking/hpc-x

Do you think this is due to our MPI or we don't launch the command correctly?

Many thanks in advance for your help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FluidityProject/fluidity/issues/331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDFJKOR25X4UFQQLA2IWTDTZZ4UTANCNFSM5BB3CPIQ.

-- Dr Jon Hill Senior Lecturer in Physical Geography Chair of Dept. Equality and Diversity Committee Chair of Board of Examiners Department of Environment and Geography University of York M: +44(0)7748254812 P: +44(0)1904 324480 Web: https://jonxhill.wordpress.com/ Web: https://envmodellinggroup.github.io/

Patol75 commented 3 years ago

In order to run this test in serial, while being in the directory of the test, you need to execute:

make input
fluidity -v2 -l Stokes-Poiseuille.flml

Have a look at fluidity -h for the meaning of the options used above. To run the test in parallel on N cores, use:

make input
mpirun -c N flredecomp -v -l -i 1 -o N Stokes-Poiseuille foo
mpirun -c N fluidity -v2 -l foo.flml
100souci commented 3 years ago

Thanks for your comments. They were very useful for our user.