Closed frenchwr closed 2 years ago
this looks like an MPI spawn issue. You could try to run a basic mpi4py spawn example, or try to see if amuse works with sockets channel, adapt amuserc.example by adding:
channel_type=sockets
in the channel section, saving then as amuserc. Sometimes (mainly on supercomputers) the MPI implementation lacks mpi spawn, but that shouldn't be the case here though - a mismatch between the mpi4py and mpi is also a possibility...
If this can be of help to anybody, we manage a cluster with slurm and easybuild and we were getting the error reported above by @frenchwr when using easybuild v3.9.1 (modules foss/2018a and Python/2.7.14-foss-2018a), slurm v18.08 and amuse v12 build on top of foss2018a. It turned out that Python/2.7.14-foss-2018a came with its own mpi4py installed and that was not working well with foss/amuse openmpi. So we re-installed mpi4py as a separate eb module on the top of the foss toolchain. Here follows an example eb config file
# computerworkers @ strw leiden
easyblock = 'PythonPackage'
name = 'mpi4py'
version = '3.0.0'
pyver='2.7.14'
pyshortver = '2.7'
versionsuffix = '-Python-2.7.14'
homepage = 'https://bitbucket.org/mpi4py/mpi4py'
description = """MPI for Python (mpi4py) provides bindings of the Message Passing Interface (MPI) standard for
the Python programming language, allowing any Python program to exploit multiple processors."""
toolchain = {'name': 'foss', 'version': '2018a'}
source_urls = [PYPI_SOURCE]
sources = [SOURCE_TAR_GZ]
checksums = [
'b457b02d85bdd9a4775a097fac5234a20397b43e073f14d9e29b6cd78c68efd7', # mpi4py-3.0.0.tar.gz
]
dependencies = [('Python', '%(pyver)s')]
# force rebuilding everything, including patched files
buildopts = '--force'
sanity_check_paths = {
'files': [],
'dirs': ['lib/python%(pyshortver)s/site-packages/mpi4py'],
}
# check that timed pingpong routines that are added via the patch are available
sanity_check_commands = [
('python', '-c "from mpi4py.MPI import Comm"'),
]
moduleclass = 'lib'
On our cluster this resulted in the following working pseudo workflow
module load amuse
module load mpi4py
srun --mem=16000 --pty bash -i
# read http://amusecode.org/wiki/openmpi
export OMPI_MCA_rmaps_base_oversubscribe=yes
mpirun python amuse_example.py
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.
closing, deal with issues as the crop up on new machines
Hello, I help manage a large compute cluster for researchers at a university, and I have a user who is attempting to run amuse on our cluster with MPI enabled. We use SLURM as a job scheduler and have OpenMPI 1.10.3 installed (as well as all the other dependencies required by amuse).
What is the recommended way to build and run amuse in this sort of context? I see several different paths based on documentation online:
http://amusecode.org/wiki/cosmocomp
http://amusecode.org/doc/reference/cartesius.html
http://www.amusecode.org/doc/reference/distributed.html
I have attempted the following:
Most of the packages build without issue, but each attempt I have made to run with MPI has failed. Note that I have not attempted to build amuse in distributed mode yet (3rd link above). When I run a simple script like the following:
I get the following output:
Note that
srun
is SLURM'smpiexec
wrapper.