geodynamics / pylith

PyLith is a finite element code for the solution of dynamic and quasi-static tectonic deformation problems.
Other
153 stars 96 forks source link

Pylith running very slow on parallel mode #440

Closed ktagen-sudo closed 2 years ago

ktagen-sudo commented 2 years ago

Hello,

we have built and installed Pylith on one of our HPC clusters. It appears that when running Pylith on parallel mode (multiple nodes) the job appears to become slower. Binding CPus to core did not seem to help either.

Some additional info about the system/setup: Pylith installer v2.2.2-2 Slurm scheduler OpenMPI 3.1.5 gcc 10.2.0 architecture x86_64 OS: CentOS Linux 7

Do you have any suggestion to this issue?

knepley commented 2 years ago

On Tue, May 24, 2022 at 12:59 AM ktagen-sudo @.***> wrote:

Hello,

we have built and installed Pylith on one of our HPC clusters. It appears that when running Pylith on parallel mode (multiple nodes) the job appears to become slower. Binding CPus to core did not seem to help either.

Some additional info about the system/setup: Pylith installer v2.2.2-2 Slurm scheduler OpenMPI 3.1.5 gcc 10.2.0 architecture x86_64 OS: CentOS Linux 7

Do you have any suggestion to this issue?

There are many variables here:

1) What type of problem are you running?

2) Is your job scheduled on multiple nodes, or actually one node?

3) If you are doing a solve, does the number of iterates stay the same?

Thanks,

 Matt

— Reply to this email directly, view it on GitHub https://github.com/geodynamics/pylith/issues/440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEORCLGLMAIGAN336OSRL3VLRO3XANCNFSM5WYAF5WQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ http://www.cse.buffalo.edu/~knepley/

baagaard-usgs commented 2 years ago

Please provide more information about the type of problem you are running, the system architecture, and how you are distributing the simulation. For example:

ktagen-sudo commented 2 years ago

Thank you to the both of you for the prompt response! I have attached the parameter file OneLayerElastic.cfg that I am using below. I have 16 processors per node (2 nodes used during testing) and I have made sure to distribute 16 processes per node. I have 8 cores per socket and 1 thread per core. My CPU model name is Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Also, in order to properly to see what was going on behind the seen on parallel mode I tried the following in interactive mode:

To request the node resources (interactive mode) srun -s --mpi=pmi2 --ntasks-per-node=16 --nodes=2 --pty bash

To run pylith pylith OneLayerElastic.cfg mymachine.cfg --nodes=32 --launcher.dry salloc --nodes=2 --ntasks-per-node=16 mpirun --hostfile $PBS_NODEFILE -np 32 pylith/bin/mpinemesis ...

####### Parameter file: OneLayerElastic.cfg

[pylithapp]

# ----------------------------------------------------------------------
# PROBLEM SUMMARY
# Inflating/deflating penny-shaped crack in a uniform elastic
# half-space 
# ----------------------------------------------------------------------

# ----------------------------------------------------------------------
# Problem
# ----------------------------------------------------------------------

[pylithapp.timedependent]

#gravity_field = spatialdata.spatialdb.GravityField

[pylithapp.timedependent.implicit]

output = [domain, groundsurface, chamber_top, chamber_bot]

output.groundsurface = pylith.meshio.OutputSolnSubset
output.chamber_top = pylith.meshio.OutputSolnSubset
output.chamber_bot = pylith.meshio.OutputSolnSubset

[pylithapp.problem.formulation]

time_step = pylith.problems.TimeStepUniform

[pylithapp.problem.formulation.time_step]

total_time = 2.0*year
dt = 1.0*year

[pylithapp.timedependent.normalizer]
length_scale = 1.0*km

# relaxation_time = 1.0*year

# shear_modulus = 4.0e+10*Pa

# ----------------------------------------------------------------------
# Materials
# ----------------------------------------------------------------------

[pylithapp.problem]

materials = [elasticDomain]

[pylithapp.timedependent]
materials.elasticDomain = pylith.materials.ElasticIsotropic3D
#materials.chamber = pylith.materials.ElasticIsotropic3D

# Elastic Magma chamber
# [pylithapp.timedependent.materials.chamber]
# label = magma body
# id= 1
# db_properties = spatialdata.spatialdb.UniformDB
# db_properties.label = Properties for elastic magma chamber chamber
# db_properties.values = [density, vs, vp]
# db_properties.data = [2500.0*kg/m**3,  4000.0*m/s,  6928.2*m/s]

# Elastic Half Space
[pylithapp.timedependent.materials.elasticDomain]
label = Elastic domain
id = 1
db_properties = spatialdata.spatialdb.UniformDB
db_properties.label = Properties for elastic half space
db_properties.values = [density, vs, vp]
db_properties.data = [2500.0*kg/m**3,  4000.0*m/s,  6928.2*m/s]

# db_initial_stress = spatialdata.spatialdb.SimpleDB
# db_initial_stress.label = Initial stress in domain
# db_initial_stress.iohandler.filename = spatialdb/initial_stress.spatialdb
# db_initial_stress.query_type = linear

# ----------------------------------------------------------------------
# BOUNDARY CONDITIONS 
# ----------------------------------------------------------------------
[pylithapp.timedependent]
bc = [chamber_top, chamber_bot, face_xpos, face_xneg, face_ypos, face_yneg, face_zneg]

bc.chamber_top = pylith.bc.Neumann
bc.chamber_bot = pylith.bc.Neumann

# ZeroDisp (Dirichlet) BCs on edges
[pylithapp.timedependent.bc.face_xpos]

bc_dof = [0]
label = face_xpos
db_initial.label = Zero displacement BC on +x face

[pylithapp.timedependent.bc.face_xneg]

bc_dof = [0]
label = face_xneg
db_initial.label = Zero displacement BC on -x face

[pylithapp.timedependent.bc.face_ypos]

bc_dof = [1]
label = face_ypos
db_initial.label = Zero displacement BC on +y face

[pylithapp.timedependent.bc.face_yneg]

bc_dof = [1]
label = face_yneg
db_initial.label = Zero displacement BC on -y face

[pylithapp.timedependent.bc.face_zneg]

bc_dof = [2]
label = face_zneg
db_initial.label = Zero displacement BC on -z face

# Magma body BCs (Neumann)

[pylithapp.timedependent.bc.chamber_top]
label = chamber_top

db_initial = spatialdata.spatialdb.UniformDB
db_initial.label = Amplitude of Neumann BC on chamber
db_initial.values  = [traction-shear-horiz, traction-shear-vert, traction-normal]
db_initial.data = [0.0*MPa, 0.0*MPa, -5.0*MPa]

quadrature.cell = pylith.feassemble.FIATSimplex
quadrature.cell.dimension = 2
quadrature.cell.quad_order = 2

[pylithapp.timedependent.bc.chamber_bot]
label = chamber_bot

db_initial = spatialdata.spatialdb.UniformDB
db_initial.label = Amplitude of Neumann BC on chamber
db_initial.values  = [traction-shear-horiz, traction-shear-vert, traction-normal]
db_initial.data = [0.0*MPa, 0.0*MPa, -5.0*MPa]

quadrature.cell = pylith.feassemble.FIATSimplex
quadrature.cell.dimension = 2
quadrature.cell.quad_order = 2

# ----------------------------------------------------------------------
# Output
# ----------------------------------------------------------------------

[pylithapp.problem.formulation.output.domain]

vertex_data_fields = [displacement, velocity]
output_freq = time_step
time_step = 1.0*year

writer = pylith.meshio.DataWriterHDF5
writer.filename = output/OneLayerElastic.h5

[pylithapp.problem.formulation.output.groundsurface]

label = face_zpos

vertex_data_fields = [displacement, velocity]
output_freq = time_step
time_step = 1.0*year

writer = pylith.meshio.DataWriterHDF5
writer.filename = output/OneLayerElastic-groundsurf.h5

[pylithapp.problem.formulation.output.chamber_top]

label = chamber_top

vertex_data_fields = [displacement, velocity]
output_freq = time_step
time_step = 1.0*year

writer = pylith.meshio.DataWriterHDF5
writer.filename = output/OneLayerElastic-chamber_top.h5

[pylithapp.problem.formulation.output.chamber_bot]

label = chamber_bot

vertex_data_fields = [displacement, velocity]
output_freq = time_step
time_step = 1.0*year

writer = pylith.meshio.DataWriterHDF5
writer.filename = output/OneLayerElastic-chamber_bot.h5
baagaard-usgs commented 2 years ago

I found the following specs for Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

If hyperthreading is turned on, it may look like you have 16 cores. Assuming the specs I found match your hardware, you are likely saturating the memory bandwidth once you use 4 MPI processes. Once you have saturated the memory bus, using additional processors will lead to negligible speed up and may even slow the computation down.

To identify the optimal distribution of MPI processes, I suggest using 1 MPI process per compute node and then try 2, 4, and then 8 and compare the runtime. Based on the number of memory channels 2 or 4 will likely produce the shortest runtime.

Your .cfg file doesn't show the PETSc solver parameters. The choice of solver parameters can have a significant impact on the overall runtime independent of the number of MPI processes. See the PyLith manual for recommended solver settings for various types of problems.

ktagen-sudo commented 2 years ago

Thank you a lot for the update!! We will follow your recommendation. Also, below are some additional parameters that may be of interest. Do you see anything that should not be there in parallel mode?

Thanks again.

###################################

[pylithapp]

----------------------------------------------------------------------

Journal

----------------------------------------------------------------------

[pylithapp.journal.info]

timedependent = 1 greensfns = 1 implicit = 1 petsc = 1 solverlinear = 1 meshiocubit = 1 implicitelasticity = 1 faultcohesivekin = 1 fiatlagrange = 1 pylithapp = 1 materials = 1

----------------------------------------------------------------------

Mesh generator

----------------------------------------------------------------------

[pylithapp.mesh_generator]

reader = pylith.meshio.MeshIOCubit

[pylithapp.mesh_generator.reader]

filename = Mesh/OneLayerElastic.exo

----------------------------------------------------------------------

PETSc

----------------------------------------------------------------------

[pylithapp.petsc]

pc_type = ilu sub_pc_factor_shift_type = nonzero

ksp_rtol = 1.0e-15 ksp_atol = 1.0e-17 ksp_max_it = 1000 ksp_gmres_restart = 50

ksp_monitor = true ksp_view = true ksp_converged_reason = true ts_type = beuler

log_summary = true

knepley commented 2 years ago

I would use

pc_Type = lu

until your problem gets too big. Right now it will be the fastest.

ktagen-sudo commented 2 years ago

@knepley and @baagaard-usgs thank you so much! It is really helpful. We will let you know/keep you guys updated once we get something significant. Thank you again.

ktagen-sudo commented 2 years ago

Hi @knepley and @baagaard-usgs I tried to use pc_Type = lu as well as setting 1 task per node but it appears that there are still issues.

(1) The h5 and xmf files take a while to download, much longer than they should. (2) I eventually got the error:

[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html for possible LU and Cholesky solvers
[0]PETSC ERROR: Could not locate a solver package. Perhaps you must ./configure with --download-<package>
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.10.2, Jul, 01, 2019 

[0]PETSC ERROR: #1 MatGetFactor() line 4415 in /users/kfotso/build/pylith/petsc-pylith/src/mat/interface/matrix.c
[0]PETSC ERROR: #2 PCSetUp_LU() line 93 in /users/kfotso/build/pylith/petsc-pylith/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: #3 PCSetUp() line 932 in /users/kfotso/build/pylith/petsc-pylith/src/ksp/pc/interface/precon.c
[0]PETSC ERROR: #4 KSPSetUp() line 391 in /users/kfotso/build/pylith/petsc-pylith/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: #5 KSPSolve() line 723 in /users/kfotso/build/pylith/petsc-pylith/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: #6 void pylith::problems::SolverLinear::solve(pylith::topology::Field*, pylith::topology::Jacobian*, const pylith::topology::Field&)() line 132 in ../../../pylith-2.2.2/libsrc/pylith/problems/SolverLinear.cc

Are the solvers choices different in parallel mode? I cam across this thread which explains that PETSC does not support Parallel LU : https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2018-October/036336.html

Thanks again for your help!

baagaard-usgs commented 2 years ago

The solver options are different in serial and parallel. PETSc can use LU in parallel, but it requires additional external libraries. See the PyLith manual section on PETSc solver settings for recommended settings. These are also available in the share/settings directory.

ktagen-sudo commented 2 years ago

@baagaard-usgs thank you for the prompt reply. I will do that.

ktagen-sudo commented 2 years ago

Hi,

I decided to install the following solver: https://petsc.org/main/docs/manualpages/Mat/MATSOLVERSUPERLU_DIST/ I downloaded https://github.com/xiaoyeli/superlu_dist/archive/v5.4.0.tar.gz, https://bitbucket.org/petsc/pkg-metis/get/v5.1.0-p5.tar.gz and https://bitbucket.org/petsc/pkg-scotch/get/6.0.6-p1.tar.gz

To reconfigure Pylith I used the options: petsc_options="--download-chaco=1 --download-ml=1 --download-f2cblaslapack=1 --with-hdf5=1 --with-fc=0 --with-hwloc=0 --with-ssl=0 --with-x=0 --with-c2html=0 --with-lgrind=0 --download-superlu_dist=1 —download-parametis=1 --download-metis=1 --download-ptscotch=1”

My configure line is the following: $HOME/src/pylith/pylith-installer-2.2.2-2/configure --with-make-threads=8 --enable-autotools=—yes --enable-openssl=yes --with-petsc-options="${petsc_options}" --prefix=$HOME/local/pylith_xena_0610

When running make, one of the test failed:

PASS testutils (exit status: 0)
make[6]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/libtests/utils'
make[5]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/libtests/utils'
make[5]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/libtests'
make[5]: Nothing to be done for 'check-am'.
make[5]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/libtests'
make[4]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/libtests'
Making check in pytests
make[4]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests'
Making check in bc 
make[5]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
Making check in data
make[6]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc/data'
if [ "X../../../../../pylith-2.2.2" != "X../../../.." ]; then for f in tri3_disp.spatialdb tri3_vel.spatialdb tri3_tractions.spatialdb tri3.mesh elasticplanestrain.spatialdb; do /bin/sh /users/kfotso/build/pylith_xena_0610/pylith-2.2.2/aux-config/install-sh -c -m 644 ../../../../../pylith-2.2.2/unittests/pytests/bc/data/$f ../../../../unittests/pytests/bc/data; done; fi
make  check-am
make[7]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc/data'
make[7]: Nothing to be done for 'check-am'.
make[7]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc/data'
make[6]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc/data'
make[6]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make  testbc.py
make[7]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[7]: Nothing to be done for '../../../../pylith-2.2.2/unittests/pytests/bc/testbc.py'.
make[7]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make  check-TESTS check-local
make[7]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[8]: Entering directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
FAIL: testbc.py
============================================================================
Testsuite summary for PyLith 2.2.2 
============================================================================
# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See unittests/pytests/bc/test-suite.log
Please report to cig-short@geodynamics.org
============================================================================
make[8]: *** [Makefile:781: test-suite.log] Error 1
make[8]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[7]: *** [Makefile:889: check-TESTS] Error 2
make[7]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[6]: *** [Makefile:985: check-am] Error 2
make[6]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[5]: *** [Makefile:673: check-recursive] Error 1
make[5]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests/bc'
make[4]: *** [Makefile:447: check-recursive] Error 1
make[4]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests/pytests'
make[3]: *** [Makefile:439: check-recursive] Error 1
make[3]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build/unittests'
make[2]: *** [Makefile:507: check-recursive] Error 1
make[2]: Leaving directory '/users/kfotso/build/pylith_xena_0610/pylith-build'
make[1]: *** [pylith] Error 2
make[1]: Leaving directory `/users/kfotso/build/pylith_xena_0610'
make: *** [installed_pylith] Error 2

I suspect that the error is related with nemesis and numpy. If I do ./testbc.py I get the following error


ImportError: numpy.core.multiarray failed to import
Traceback (most recent call last):
  File "./testbc.py", line 65, in <module>
    app.run()
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pythia-0.8.1.19-py2.7.egg/pyre/applications/Application.py", line 42, in run
    shell.run(*args, **kwds)
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pythia-0.8.1.19-py2.7.egg/pyre/applications/Shell.py", line 143, in run
    method(*args, **kwds)
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pythia-0.8.1.19-py2.7.egg/pyre/applications/Stager.py", line 19, in execute
    return self.main(*args, **kwds)
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/tests/UnitTestApp.py", line 52, in main
    success = unittest.TextTestRunner(verbosity=2).run(self._suite()).wasSuccessful()
  File "./testbc.py", line 47, in _suite
    from TestDirichletBC import TestDirichletBC
  File "/users/kfotso/build/pylith_xena_0610/pylith-2.2.2/unittests/pytests/bc/TestDirichletBC.py", line 25, in <module>
    from pylith.bc.DirichletBC import DirichletBC
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/bc/DirichletBC.py", line 26, in <module>
    from BoundaryCondition import BoundaryCondition
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/bc/BoundaryCondition.py", line 32, in <module>
    from bc import BoundaryCondition as ModuleBoundaryCondition
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/bc/bc.py", line 28, in <module>
    _bc = swig_import_helper()
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/bc/bc.py", line 24, in swig_import_helper
    _mod = imp.load_module('_bc', fp, pathname, description)
ImportError: numpy.core.multiarray failed to import

If I call the bin nemesis and I write "import numpy" I get the following error:

ImportError: 
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: /users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/numpy-1.14.3-py2.7-linux-x86_64.egg/numpy/core/multiarray.so: undefined symbol: PyUnicodeUCS2_AsUTF8String

but if I do python testbc.py the test runs fine. If I call python and I try to import numpy, everything runs fine as well.

Here is what is contained in my setup.sh file:

export PATH=/users/kfotso/local/pylith_xena_0610/bin:${PATH}
export LD_LIBRARY_PATH=/users/kfotso/local/pylith_xena_0610/lib
export PYTHONPATH=/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages

How do you think I can fix this? I have attached the make.log output file as well. make.log

baagaard-usgs commented 2 years ago

When you reconfigure with the installer and start from the beginning, you should first delete any previously failed builds/configurations. If you are installing to the same directory, then do rm -r $HOME/local/pylith_xena_0610 before running the configure for the PyLith installer. Also check to make sure your PATH, PYTHONPATH, and LD_LIBRARY_PATH are set to minimal values before doing source setup.sh. If they are set from the previous install and you are installing to the same location, then there is no reason to do source setup.sh again; doing so would simply add duplicate entries.

Addendum: I would also delete everything in the build directory for the previous build that failed.

ktagen-sudo commented 2 years ago

Thank you @baagaard-usgs ! I did try your suggestion, but I still get the same error for testbc.py unfortunately.

baagaard-usgs commented 2 years ago

This error is associated with inconsistencies in the environment or changes in the location of files. The only solution is to make sure previous builds and installs are deleted, you start with an environment that points only to system directories (for example, PATH pointing to /bin:/usr/bin:/usr/sbin:/sbin and PYTHONPATH and LD_LIBRARY_PATH are unset. The only exception to this is if you are using system MPI and other packages that are in different locations (this often happens on a cluster with the system administrator installing MPI, etc to places like /opt or /usr/local).

Make sure your environment during the build and use of PyLith remain the same.

ktagen-sudo commented 2 years ago

Thank you! So I tried to keep my environment as minimal as possible but I still got the same error during tests. I then went to rebuild and before building I changed the Makefile and added --enable-unicode=ucs4 under the section python (configuration). It then worked and passed all the tests.

but when calling pylith I get the error:

Traceback (most recent call last):
  File "/users/kfotso/local/pylith_xena_0610/bin/pylith", line 25, in <module>
    from pylith.apps.PyLithApp import PyLithApp
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/apps/PyLithApp.py", line 23, in <module>
    from PetscApplication import PetscApplication
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pylith/apps/PetscApplication.py", line 24, in <module>
    from mpi import Application
  File "/users/kfotso/local/pylith_xena_0610/lib/python2.7/site-packages/pythia-0.8.1.19-py2.7.egg/mpi/__init__.py", line 14, in <module>
    from _mpi import *
ImportError: No module named _mpi

I can import numpy when I call nemesis though, which I clearly could not do before. I also made sure to keep my environment consistent between the build and the use of Pylith.

What is your opinion on this?

I really appreciate you guys help by the way.

baagaard-usgs commented 2 years ago

If all of the tests pass, then I think the environment used in the build is okay. Now you need to make sure the environment you use when running pylith exactly matches the environment you used to build.

When you get an error like this, try running:

nemesis
# The output should be something like
Python 2.7.XX
[gcc-X.X] on darwin
Type "help", "copyright", "credits" or "license" for more information.

# At the Python prompt `>>>` type `import _mpi` (if this succeeds you should get another Python prompt with no other output)
>>> import _mpi
>>>
ktagen-sudo commented 2 years ago

Thank you! I confirm that my nemesis output is fine

~/build/pylith_xena_0610$ nemesis
Python 2.7.5 (default, Nov 16 2020, 22:23:17) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import _mpi
>>> exit()
knepley commented 2 years ago

If that runs fine, then everything is there. Something must be wrong in the environment when you try to run PyLith.

ktagen-sudo commented 2 years ago

Thank you @baagaard-usgs and @knepley ! I did review my environment to make sure that everything is correct but so far no luck. Whenever I directly invoke "pylith" I still get the error "ImportError: No module named _mpi".

"pylith_eqinfo" and "pylith_genxdmf" run fine though.

"pylithinfo" shows the same _mpi error.

baagaard-usgs commented 2 years ago

pylith_eqinfo and pylith_genxdmf are pure Python with no MPI interaction. I think the problem is MPI related. Importing _mpi works, so the library paths seem to be correct. My hunch is that the PyLith application is invoking mpirun or mpiexec that is pointing to the wrong MPI. What is the output of which mpiexec and which mpirun? Do they point to the same MPI?

ktagen-sudo commented 2 years ago

Surprisingly, they seem to be pointing to the same MPI

kfotso@xena:~/data-pylith-941422$ which mpirun
/opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/bin/mpirun
kfotso@xena:~/data-pylith-941422$ which mpiexec
/opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/bin/mpiexec

What I have been doing is to sort of "wrap" nemesis around pylith in order to run it: salloc --nodes=3 --ntasks-per-node=1 --cpus-per-task=8 nemesis ~/local/pylith_xena_0610/bin/pylith OneLayerElastic.cfg --nodes=3

Do you guys think it is okay to that at all?

baagaard-usgs commented 2 years ago

Does ldd REPLACE_WITH_PATH_TO_PYLITH/lib/libpylith.so show the same MPI?

The PyLith executable will automatically run nemesis. Don't invoke PyLith using nemesis. First, it invokes Python and does some minimal checking, and then it forks and starts up in parallel. This means you don't want to start pylith on multiple processes. See the PyLith manual for how to interact with a scheduler.

ktagen-sudo commented 2 years ago

Surprisingly again it appears to be showing the same MPI. See below. And thanks for letting me know.

kfotso@xena:~/data-pylith-941422$ ldd ~/local/pylith_xena_0610/lib/libpylith.so
    linux-vdso.so.1 =>  (0x00007fffb7334000)
    libspatialdata.so.0 => /users/kfotso/local/pylith_xena_0610/lib/libspatialdata.so.0 (0x00002b810a749000)
    libpetsc.so.3.10 => /users/kfotso/local/pylith_xena_0610/lib/libpetsc.so.3.10 (0x00002b810a99f000)
    libsuperlu.so.5 => /users/kfotso/local/pylith_xena_0610/lib/libsuperlu.so.5 (0x00002b810bdbb000)
    libsuperlu_dist.so.5 => /users/kfotso/local/pylith_xena_0610/lib/libsuperlu_dist.so.5 (0x00002b810c043000)
    libhdf5_hl.so.100 => /users/kfotso/local/pylith_xena_0610/lib/libhdf5_hl.so.100 (0x00002b810c2cd000)
    libmetis.so => /users/kfotso/local/pylith_xena_0610/lib/libmetis.so (0x00002b810c4ee000)
    librt.so.1 => /usr/lib64/librt.so.1 (0x00002b810c75a000)
    libz.so.1 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/zlib-1.2.11-lfstuy3usiwwcqe7dmsuetgn4soh7gnf/lib/libz.so.1 (0x00002b810c962000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00002b810cb79000)
    libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002b810cd7d000)
    libpython2.7.so.1.0 => /users/kfotso/local/pylith_xena_0610/lib/libpython2.7.so.1.0 (0x00002b810cf80000)
    libnetcdf.so.15 => /users/kfotso/local/pylith_xena_0610/lib/libnetcdf.so.15 (0x00002b810d384000)
    libproj.so.13 => /users/kfotso/local/pylith_xena_0610/lib/libproj.so.13 (0x00002b810d675000)
    libhdf5.so.103 => /users/kfotso/local/pylith_xena_0610/lib/libhdf5.so.103 (0x00002b810d8ea000)
    libmpi_cxx.so.40 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/lib/libmpi_cxx.so.40 (0x00002b810de98000)
    libmpi.so.40 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/lib/libmpi.so.40 (0x00002b810e0b4000)
    libstdc++.so.6 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/gcc-10.2.0-3kjqvw7masskmxqtxblo5khyshwe6zuw/lib64/libstdc++.so.6 (0x00002b810e53e000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00002b810e90c000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00002b810ec0e000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00002b810ee2a000)
    libgcc_s.so.1 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-4.8.5/gcc-10.2.0-3kjqvw7masskmxqtxblo5khyshwe6zuw/lib64/libgcc_s.so.1 (0x00002b810f1f8000)
    /lib64/ld-linux-x86-64.so.2 (0x00002b810a13f000)
    libopen-rte.so.40 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/lib/libopen-rte.so.40 (0x00002b810f410000)
    libopen-pal.so.40 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/openmpi-4.0.5-xt7dmkzqncf6of3sknaac4l7vht4bh2h/lib/libopen-pal.so.40 (0x00002b810f733000)
    libpmi2.so.0 => /lib/libpmi2.so.0 (0x00002b810fb8a000)
    libpmi.so.0 => /lib/libpmi.so.0 (0x00002b810fda2000)
    libhwloc.so.15 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/hwloc-2.4.0-uivl3wkdteubouxsgcy33kwhb4q7etqu/lib/libhwloc.so.15 (0x00002b810ffa8000)
    libevent_core-2.1.so.7 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/libevent-2.1.12-i4cbdb5tiutjjw6xqrxjobhydxol3dj4/lib/libevent_core-2.1.so.7 (0x00002b8110200000)
    libevent_pthreads-2.1.so.7 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/libevent-2.1.12-i4cbdb5tiutjjw6xqrxjobhydxol3dj4/lib/libevent_pthreads-2.1.so.7 (0x00002b8110435000)
    libresolv.so.2 => /usr/lib64/libresolv.so.2 (0x00002b8110638000)
    libslurm_pmi.so => /usr/lib/slurm/libslurm_pmi.so (0x00002b8110852000)
    libpciaccess.so.0 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/libpciaccess-0.16-4oz6hkg6iu7t7vgh6hlv5oppy5jdspqs/lib/libpciaccess.so.0 (0x00002b8110c0a000)
    libxml2.so.2 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/libxml2-2.9.10-4xxkldakgk47qw5ylozbtx7gtjb75qcs/lib/libxml2.so.2 (0x00002b8110e13000)
    liblzma.so.5 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/xz-5.2.5-zknipgfmx4723eztcdr7vrjpgco7nh7u/lib/liblzma.so.5 (0x00002b811117b000)
    libiconv.so.2 => /opt/spack/opt/spack/linux-centos7-haswell/gcc-10.2.0/libiconv-1.16-hg7ja2qsrugfxx2n3qrh22nchzggqlvv/lib/libiconv.so.2 (0x00002b81113a2000)
baagaard-usgs commented 2 years ago

This seems to point towards how you are starting PyLith in parallel. To verify your PyLith installation and environment, try running one of the examples in serial.

ktagen-sudo commented 2 years ago

@baagaard-usgs can you clarify how I can run some pylith examples in serial since calling "pylith" cannot work?

baagaard-usgs commented 2 years ago

This is what I wanted to know. Above you showed trying to run pylith in parallel. I would first make sure it runs in serial. If it doesn't run with one process, then it is extremely unlikely to run with multiple processes.

  1. Do all of the full-scale tests pass when you run make check?
  2. If so, then pylith should work in serial on the computer where you built it if the environment matches the one used to configure the software. The full-scale tests actually run pylith.
  3. If the full-scale tests pass and you can't run an example in serial, then look for any differences in the environment between what gets embedded in the Makefile during configure and what you have when you try to run an example.
ktagen-sudo commented 2 years ago
  1. All full-scale test pass when running make checkin the "pylith-build directory" and in the "build" directory. I have attached the test results from the make check in the "pylith-build" directory.

  2. I looked at the differences in environment and try to reproduce as closely as possible the environment contained in the Makefile without any luck.

  3. I rebuilt it just to make sure that I did not miss anything that went wrong, but the configuration remains the same.

  4. I finally compared the "pylith" file that I built with the "pylith" file from a previous build that we had before that did not include all those parallel solver packages. I also compared it with the "pylith" file from the Pylith package that is meant to be used on a single node only (serial).

    I noticed a major difference at the top: My new build has the following at the top:

#!/users/kfotso/pylith/bin/python2

The pylith files from those other build include the following: #!/users/kfotso/pylith/bin/nemesis

Should I just change that line and I will be fine?

As to why this change appeared, right before my build I had to add the following in my setup.sh: export PYTHON=/users/kfotso/local/pylith/bin/python2 If I did not add that I would have got a "fatal error: Python.h: No such file or directory)" make_check_pylit_061722.log

baagaard-usgs commented 2 years ago

Yes, use nemesis instead of python2.

Setting PYTHON is definitely not the correct solution.

ktagen-sudo commented 2 years ago

Thank you @baagaard-usgs! I just did it.

ktagen-sudo commented 2 years ago

Thank you so much for both you guys help. I really appreciate it. I do not have any question at the moment so I will go ahead and close this issue.