OPM / LBPM

Pore scale modelling
https://lbpm-sim.org/
GNU General Public License v3.0
69 stars 31 forks source link

The Open MPI OFI MTL is aborting the MPI job #48

Open ahmedsrizk95 opened 3 years ago

ahmedsrizk95 commented 3 years ago

Hello Dr. James,

I have found this error to occur when running on CPU:

>  mtl_ofi.h:101: Error returned from fi_cq_readerr: Resource temporarily unavailable(-11).
> *** The Open MPI OFI MTL is aborting the MPI job (via exit(3)).
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID ####### on node ####### exited on signal 6 (Aborted).
> --------------------------------------------------------------------------

I have found that we can avoid it by adding the following flag:

-mca mtl ^ofi

However, this makes the code not to run on more than 2 CPU cores. Even if specified more than 2 only 2 will be working. So why this problem is happening?

JamesEMcClure commented 3 years ago

How many CPU cores do you have available on your system? Usually you can get this by running the command

cat /proc/cpuinfo

I will also need to know the version of mpi that you are using. For trouble-shooting it is often useful to look at the output of ompi_info. More specifically, in this case you might check

ompi_info | grep mtl

To see what options are available based on how openmpi was installed on your system

xu-kai-xu commented 2 years ago

@JamesEMcClure hi, james, I run lbpm on a ubuntu vitrual machine. I can follow step 3, but when it comes to step 4, I have to limit the voxel size to 100, and also use one core. But I get the following message:

--------------------------------------------------------------------------
[[19063,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: lbpm

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
********************************************************
Running Single Phase Permeability Calculation 
********************************************************
voxel length = 7.000000 micron 
voxel length = 7.000000 micron 
Input media: mask_water_flooded_water_and_oil.raw
Relabeling 3 values
oldvalue=0, newvalue =0 
oldvalue=1, newvalue =2 
oldvalue=2, newvalue =1 
Dimensions of segmented image: 601 x 594 x 1311 
Reading 16-bit input data 
Read segmented data from mask_water_flooded_water_and_oil.raw 
Label=0, Count=403551588 
Label=1, Count=41440830 
Label=2, Count=23026716 
Checkerboard pattern at z inlet for 10 layers, saturated with phase label=1 
Distributing subdomains across 1 processors 
Process grid: 1 x 1 x 1 
Subdomain size: 100 x 100 x 100 
Size of transition region: 0 
Media porosity = 0.003939 
Initialized solid phase -- Converting to Signed Distance function 
Domain set.
Create ScaLBL_Communicator 
Set up memory efficient layout 
Allocating distributions 
Setting up device map and neighbor list 
  MLPUS=16.448507 from rank 0
Initializing distributions 
Beginning AA timesteps, timestepMax = 20000 
********************************************************
     0.016099
     0.016091
-------------------------------------------------------------------
********************************************************
CPU time = 0.003456 
Lattice update rate (per core)= 15.609531 MLUPS 
Lattice update rate (total)= 15.609531 MLUPS 
********************************************************
Program abort called in file '/home/lbpm/lbpm/source/common/Database.cpp' at line 83:
   Variable Visualization was not found in database
Bytes used = 999736160
Stack Trace:
 [1] 0x55dd906dd8fe:  lbpm_permeability_simulator                                    _start
   [1] 0x55dd906dd83d:  lbpm_permeability_simulator                                      main
     [1] 0x55dd9082bdd8:  lbpm_permeability_simulator          ScaLBL_MRTModel::VelocityField()
       [1] 0x55dd907739c6:  lbpm_permeability_simulator  Database::getDatabase(std::string const&)
         [1] 0x55dd90772b62:  lbpm_permeability_simulator     Database::getData(std::string const&)
           [1] 0x55dd90836f02:  lbpm_permeability_simulator  StackTrace::Utilities::abort(std::string const&, std::string const&, int)
             [1] 0x55dd9083e5ea:  lbpm_permeability_simulator                   StackTrace::backtrace()
 [2] 0x7f1b156f5133:             libc.so.6                                     clone
   [2] 0x7f1b157f4609:       libpthread.so.0                                            pthread_create.c:478
     [1] 0x7f1b14a0eff6:    mca_pmix_pmix2x.so                                            pmix_progress_threads.c
     | [1] 0x7f1b15342ee1:     libopen-pal.so.40                                            epoll.c:409
     |   [1] 0x7f1b156f546e:             libc.so.6                                epoll_wait
     |     [1] 0x7f1b15800420:       libpthread.so.0                                            sigaction.c
     [1] 0x7f1b15302d66:     libopen-pal.so.40                                            opal_progress_threads.c
       [1] 0x7f1b1534e911:     libopen-pal.so.40                                            poll.c:167
         [1] 0x7f1b156e899f:             libc.so.6                                    __poll
           [1] 0x7f1b15800420:       libpthread.so.0                                            sigaction.c
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node lbpm exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Does this mean the installation not successful? or because the virtual ubuntu does not have enough cores. I can get two out puts, 0.016099 and 0.016091, but not the final results.

JamesEMcClure commented 2 years ago

It is installed correctly, it just expects a Visualization database in the input file (the permeability simulator will write the final velocity field to disk by default after the simulation completes). See this link for additional summary:

https://lbpm-sim.org/userGuide/models/mrt/mrt.html

You should be able to fix this by adding the following lines to your input file:

Visualization {
}
xu-kai-xu commented 2 years ago

thanks for that! @JamesEMcClure