Recommendations for Running Amuse in a Shared HPC/SLURM Environment

frenchwr commented 6 years ago

Hello, I help manage a large compute cluster for researchers at a university, and I have a user who is attempting to run amuse on our cluster with MPI enabled. We use SLURM as a job scheduler and have OpenMPI 1.10.3 installed (as well as all the other dependencies required by amuse).

What is the recommended way to build and run amuse in this sort of context? I see several different paths based on documentation online:

http://amusecode.org/wiki/cosmocomp

http://amusecode.org/doc/reference/cartesius.html

http://www.amusecode.org/doc/reference/distributed.html

I have attempted the following:

module load GCC/5.4.0-2.26
module load git/2.12.2
module load OpenMPI/1.10.3
module load Python/2.7.12
module load FFTW/3.3.4
module load matplotlib/1.5.3-Python-2.7.12
module load numpy/1.12.1-Python-2.7.12
module load mpi4py/2.0.0-Python-2.7.12
git clone git@github.com:amusecode/amuse.git
cd amuse
./configure MPIEXEC=/usr/scheduler/slurm/bin/srun
make

Most of the packages build without issue, but each attempt I have made to run with MPI has failed. Note that I have not attempted to build amuse in distributed mode yet (3rd link above). When I run a simple script like the following:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --ntasks=8
#SBATCH --mem=10G
#SBATCH --nodes=1

module restore amuse
srun -n 1 amuse.sh examples/simple/hrdiagram.py

I get the following output:

[vmp588:26517] [[36800,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at line 1100
Traceback (most recent call last):
  File "examples/simple/hrdiagram.py", line 60, in <module>
    temperatures, luminosities = simulate_evolution_tracks()
  File "examples/simple/hrdiagram.py", line 20, in simulate_evolution_tracks
    stellar_evolution = SSE()
  File "/gpfs22/home/frenchwr/tmp/amuse/src/amuse/community/sse/interface.py", line 211, in __init__
    InCodeComponentImplementation.__init__(self, SSEInterface(**options), **options)
  File "/gpfs22/home/frenchwr/tmp/amuse/src/amuse/community/sse/interface.py", line 26, in __init__
    CodeInterface.__init__(self, name_of_the_worker="sse_worker", **options)
  File "/gpfs22/home/frenchwr/tmp/amuse/src/amuse/rfi/core.py", line 711, in __init__
    self._start(name_of_the_worker = name_of_the_worker, **options)
  File "/gpfs22/home/frenchwr/tmp/amuse/src/amuse/rfi/core.py", line 739, in _start
    self.channel.start()
  File "/gpfs22/home/frenchwr/tmp/amuse/src/amuse/rfi/channel.py", line 1525, in start
    self.intercomm = MPI.COMM_SELF.Spawn(command, arguments, self.number_of_workers, info=self.info)
  File "MPI/Comm.pyx", line 1559, in mpi4py.MPI.Intracomm.Spawn (src/mpi4py.MPI.c:113260)
mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error
srun: error: vmp588: task 0: Exited with exit code 1

Note that srun is SLURM's mpiexec wrapper.

ipelupessy commented 6 years ago

this looks like an MPI spawn issue. You could try to run a basic mpi4py spawn example, or try to see if amuse works with sockets channel, adapt amuserc.example by adding:

channel_type=sockets

in the channel section, saving then as amuserc. Sometimes (mainly on supercomputers) the MPI implementation lacks mpi spawn, but that shouldn't be the case here though - a mismatch between the mpi4py and mpi is also a possibility...

computerworkers commented 5 years ago

If this can be of help to anybody, we manage a cluster with slurm and easybuild and we were getting the error reported above by @frenchwr when using easybuild v3.9.1 (modules foss/2018a and Python/2.7.14-foss-2018a), slurm v18.08 and amuse v12 build on top of foss2018a. It turned out that Python/2.7.14-foss-2018a came with its own mpi4py installed and that was not working well with foss/amuse openmpi. So we re-installed mpi4py as a separate eb module on the top of the foss toolchain. Here follows an example eb config file

# computerworkers @ strw leiden
easyblock = 'PythonPackage'

name = 'mpi4py'
version = '3.0.0'
pyver='2.7.14'
pyshortver = '2.7'

versionsuffix = '-Python-2.7.14'

homepage = 'https://bitbucket.org/mpi4py/mpi4py'
description = """MPI for Python (mpi4py) provides bindings of the Message Passing Interface (MPI) standard for
 the Python programming language, allowing any Python program to exploit multiple processors."""

toolchain = {'name': 'foss', 'version': '2018a'}

source_urls = [PYPI_SOURCE]
sources = [SOURCE_TAR_GZ]
checksums = [
    'b457b02d85bdd9a4775a097fac5234a20397b43e073f14d9e29b6cd78c68efd7',  # mpi4py-3.0.0.tar.gz
]

dependencies = [('Python', '%(pyver)s')]

# force rebuilding everything, including patched files
buildopts = '--force'

sanity_check_paths = {
    'files': [],
    'dirs': ['lib/python%(pyshortver)s/site-packages/mpi4py'],
}

# check that timed pingpong routines that are added via the patch are available
sanity_check_commands = [
    ('python', '-c "from mpi4py.MPI import Comm"'),
]

moduleclass = 'lib'

On our cluster this resulted in the following working pseudo workflow

module load amuse
module load mpi4py
srun   --mem=16000  --pty bash -i
# read http://amusecode.org/wiki/openmpi
export OMPI_MCA_rmaps_base_oversubscribe=yes
mpirun python amuse_example.py

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 28 days if no further activity occurs. Thank you for your contributions.

ipelupessy commented 2 years ago

closing, deal with issues as the crop up on new machines

amusecode / amuse

Recommendations for Running Amuse in a Shared HPC/SLURM Environment #234