NPROC > 1 not working - Githubissues

raulleoncz commented 4 months ago

Hello Mr. @bch0w,

I'm trying to use the MPI option for nproc>1. I already compiled specfem2d using FC=ifort, CC=icc and MPIFC=mpiifort but I'm getting this error:

Captura de pantalla 2024-02-27 a la(s) 10 57 16 p m

The parameters than I'm using for the simulation are:

system: workstation
optimize: LBFGS
nproc: 20
ntask: 12
mpiexec: mpirun

Can you help me to understand why I'm getting this error? PD. I also try to run the example 1 with nproc 4 and I got this error:

I hope you can help me. Thanks.

bch0w commented 4 months ago

Hi @raulleoncz, sorry you're having issues with the example problem, thanks for providing the error messages.

Starting with the second issue, I think this is coming from an update to the SPECFEM2D parameter file that has broken one of the functionalities used in the example (related #196). I'll have to make an update to the code to fix this, sorry!

Regarding your first issue, seems like there is some trouble reading your SPECFEM model, if I am reading the error message correctly, it seems like all 20 parts of the model file may be empty? Are you able to check the outputs of meshfem/specfem to make sure they ran properly?

bch0w commented 4 months ago

Hi @raulleoncz, I think I fixed the second issue you were seeing in #197 and the subsequent devel commits (if you are using devel branch). Can you please update and let me know if that solves that issue?

raulleoncz commented 4 months ago

Hi @bch0w, I already ran the example using the devel branch. I didn't get the previous error but I got this:

Captura de pantalla 2024-02-29 a la(s) 7 52 48 p m

bch0w commented 4 months ago

Hi @raulleoncz, woops sorry there was a missing import statement there, I've added that to the latest commit (9c2c082). However that code block was behind a Timeout error so you would have encountered the following error message:

https://github.com/adjtomo/seisflows/blob/9c2c0824176712b56f72253bed53b032603ce771/seisflows/system/workstation.py#L256-L259

That suggests that something may be going wrong with your forward simulation, either you need to increase tasktime to allow the simulation time to finish, or check the output logs in scratch/solver/mainsolver/fwd_solver.log to see if there is something going wrong with specfem2d

raulleoncz commented 4 months ago

Hi @bch0w. You are right, after modifying the 'tasktime' in example 1 the simulation runs without any troubles.

On the other hand, in the first error I showed above, I have checked the .bin files when running for nproc>1. Fortunately, specfem has added a python script to visualize the 'proc000....bin' files and those files look correct.

--- Update ---

I have been checking the example's files and the first thing I noticed is that xmeshfem2D was ran with mpi (first thing I was doing differently): "mpirun -n $NPROC ./bin/xmeshfem2D" something that produces files like mesh0000{number of proc}_{variable}.vtk

Also, when looking the mesher_log.txt we can see that the total number of elements were divided equally, this means that each processor has (in the example case) 400 elements. Comparing my simulation with the example, I see that this condition is not being met. For example, my simulation has 58871, 56329, 57523 and 57677 elements per processor. Is it possible that this affects the simulation?

Thanks for the help.

raulleoncz commented 4 months ago

Based on the last idea, I run forward simulation using a .xyz file and looking for an equally distribution of the spectral elements, literally running xmeshfem2D again and again. After having same number of elements in both init and true models I submitted the job and the first time I got this error:

The external numerical solver has returned a nonzero exit code (failure). Consider stopping any currently running jobs to avoid wasted computational resources. Check 'scratch/solver/mainsolver/fwd_solver.log' for the solvers stdout log message. The failing command and error message are:

exc: mpirun -n 4 bin/xspecfem2D err: Command 'mpirun -n 4 bin/xspecfem2D' returned non-zero exit status 2.

bch0w commented 3 months ago

Hi @raulleoncz, sorry for the slow response here, I'm still trying to figure out the exact issue you're facing.

I have been checking the example's files and the first thing I noticed is that xmeshfem2D was ran with mpi (first thing I was doing differently): "mpirun -n $NPROC ./bin/xmeshfem2D" something that produces files like mesh0000{number of proc}_{variable}.vtk

When you run SPECFEM with nproc > 1 then it is natural for the mesh and simulation to be split over many processors, so this seems fine and expected.

Also, when looking the mesher_log.txt we can see that the total number of elements were divided equally, this means that each processor has (in the example case) 400 elements. Comparing my simulation with the example, I see that this condition is not being met. For example, my simulation has 58871, 56329, 57523 and 57677 elements per processor. Is it possible that this affects the simulation?

I suspect something is going wrong with meshfem or specfem, do you mind sharing the following log files, you can probably attach them to your message directly or in a zip file, that would help diagnose the problem.

scratch/solver/mainsolver/fwd_solver.log
scratch/solver/mainsolver/fwd_mesher.log
scratch/solver/mainsolver/OUTPUT_FILES/output_mesher.txt
scratch/solver/mainsolver/OUTPUT_FILES/output_solver.txt

raulleoncz commented 3 months ago

Hello @bch0w, I'm sorry for my slow response.

I was trying to run the simulations again but I wasn't able to get the same mesh partition. The error that I got is the same as in the first image ("The array has an inhomogeneous shape") and because of that I'm not able to add the log files.

Just to give more information, I tried with the last version of Specfem2D (devel brach 8.1.0) and the version of the example 1. Both of them worked as it should be but when I wanted to run another example of specfem, let's say the "tomographic_ocean_model" example, I faced the same error. I don't know if the gcc, mpif90 and gfortran version has something to do. Just in case, I'm using openmpi-gcc12 and fftw 3.3.10_0+gfortran.

Are you able to run the simulations with mpirun? Maybe I'm using a wrong version or configuration.

bch0w commented 3 months ago

Hi @raulleoncz, if I'm understanding correctly, this sounds more like a SPECFEM2D issue than a SeisFlows issue. Similarly the SeisFlows examples are really only configured to run a very specific SPECFEM2D problem so there is no guarantee that switching to a different example will work. I'd encourage you to open an issue with SPECFEM (https://github.com/SPECFEM/specfem2d/issues) and hopefully you can get some more targeted feedback.

raulleoncz commented 2 months ago

Hello @bch0w, I am really sorry for my late response. I have been investigating and what I found is that after dividing the elements, each processor has their own number of elements and what it is happening is that we are like trying to combine arrays of different shapes. I found that here: https://www.golinuxcloud.com/setting-an-array-element-with-a-sequence/

In one of my examples, I was using 4 processors and the elements were 58871, 56329, 57523 and 57677 elements per processor and it did not worked. I already tried to run one example with 2 processors and it worked because the elements were the same in both processors. I will add below the fwd_mesher.log and fwd_solver.log. fwd_mesher.log fwd_solver.log

Also, I add pictures of the domain of each processor:

I hope this information can be useful...

adjtomo / seisflows

NPROC > 1 not working #194