amusecode / amuse

Astrophysical Multipurpose Software Environment. This is the main repository for AMUSE
http://www.amusecode.org
Apache License 2.0
152 stars 98 forks source link

Building and running with GPUs #1056

Open fredt00 opened 2 months ago

fredt00 commented 2 months ago

Hi,

I'm trying to get amuse up and running with GPUs but haven't had any success. Specifically I want petar and fastkick to run on GPUs. I've been using this script to build amuse:

#! /bin/bash

module purge
module load foss/2022a
module load CUDA/11.7.0
module load GSL/2.7-GCC-11.3.0
module load Miniconda3/23.1.0-1

#
# Change path below to where you want this installed:
#
export INST_DIR=$HOME/soft/amuse-gpu
#
#
export ENV_DIR=$INST_DIR/amuse-env

echo "This script will install AMUSE in the directory: $INST_DIR"
read -r -p "Are you sure? [y/N]" -n 1
echo 
if [[ "$REPLY" =~ ^[Yy]$ ]]; then

    mkdir -p $INST_DIR
    cd $INST_DIR

    conda create -y --prefix $ENV_DIR --copy python=3.10
    conda init bash
    conda activate $ENV_DIR

    conda install -y mpi4py docutils numpy pytest h5py matplotlib scipy astropy pandas seaborn

    # edit from here to instead install amuse from source so we can configure it with GPU eventually
    cd $INST_DIR
    git clone -b feature/galaxy-cluster https://github.com/fredt00/amuse.git
    cd amuse
    pip install --upgrade pip
    pip install -e . --no-cache-dir
    ./configure --enable-cuda 
    make framework
    make petar.code
    make bhtree.code
    make fastkick.code
    make halogen.code
    make hop.code
    make fi.code

fi

But it always fails at ./configure, complaining that configure: error: cannot find cuda runtime libraries in /apps/system/easybuild/software/CUDA/11.7.0/lib /apps/system/easybuild/software/CUDA/11.7.0/lib64.

This slightly convoluted installation seems to be the only way to get MPI working correctly for the non GPU installation which only works at runtime if I use miniconda as above.

I tried running just ./configure and then manually editing config.mk to

CUDA_ENABLED=yes
NVCC=/apps/system/easybuild/software/CUDA/11.7.0/bin/nvcc
NVCC_FLAGS=
CUDA_TK=/apps/system/easybuild/software/CUDA/11.7.0
CUDA_LIBS=-L/apps/system/easybuild/software/CUDA/11.7.0/targets/x86_64-linux/lib/stubs -lcuda  -L/apps/system/easybuild/software/CUDA/11.7.0/lib64 -lcudart

And the GPU versions of the codes built successfully. However then running them with this script:

#!/bin/bash -l

#SBATCH -J galaxy-cluster
#SBATCH -o galaxy-cluster.%J.out
#SBATCH -e galaxy-cluster.%J.err
#SBATCH --partition=devel
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28
#SBATCH --gpus=4
#SBATCH --mem-per-cpu=4000
#SBATCH --time=00:10:00

export OMPI_MCA_rmaps_base_oversubscribe=yes
export OMPI_MCA_mpi_warn_on_fork=0
export OMPI_MCA_rmaps_base_oversubscribe=yes
export OMP_STACKSIZE=128M
export OMP_NUM_THREADS=2
ulimit -s unlimited

module purge
module load foss/2022a
module load GSL/2.7-GCC-11.3.0
module load Miniconda3/23.1.0-1
conda activate /home/oxfd1327/soft/amuse-gpu/amuse-env
nvidia-smi
mpirun python -u $@

And I get the error:

/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py:964: UserWarning: MPI (unexpectedly?) not available, falling back to sockets channel
  warnings.warn("MPI (unexpectedly?) not available, falling back to sockets channel")

**********************************************************

mpiexec does not support recursive calls

**********************************************************
Traceback (most recent call last):
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1778, in accept_worker_connection
    return server_socket.accept()
  File "/home/oxfd1327/soft/amuse-gpu/amuse-env/lib/python3.10/socket.py", line 293, in accept
    fd, addr = self._accept()
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/oxfd1327/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 349, in <module>
    main(**o.__dict__)
  File "/home/oxfd1327/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 207, in main
    cluster = star_cluster(code=petar,code_converter=converter_petar, W0=W0, r_tidal=r_tidal,r_half=r_half, n_particles=N_cluster, M_cluster=M_cluster, field_code=FastKick,field_code_number_of_workers=1,code_number_of_workers=2)
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/ext/derived_grav_systems.py", line 94, in __init__
    self.bound=code(self.converter, mode='gpu',number_of_workers=code_number_of_workers)
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/community/petar/interface.py", line 409, in __init__
    petarInterface(**keyword_arguments),
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/community/petar/interface.py", line 38, in __init__
    CodeInterface.__init__(
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py", line 748, in __init__
    self._start(name_of_the_worker = name_of_the_worker, **options)
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py", line 776, in _start
    self.channel.start()
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1962, in start
    self.socket, address = self.accept_worker_connection(server_socket, self.process)
  File "/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/channel.py", line 1782, in accept_worker_connection
    raise exceptions.CodeException('could not connect to worker, worker process terminated')
amuse.support.exceptions.CodeException: could not connect to worker, worker process terminated
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[27418,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Is there anything obviously wrong with this process? Any help would be greatly appreciated!

Cheers, Fred

LourensVeen commented 2 months ago

Hi Fred,

There are a few things that jump out to me in your scripts:

1) neither of them load an MPI module, 2) the compile script loads the CUDA module but the run script doesn't, 3) you're starting Python using mpirun.

If MPI is available on the machine without a module and that's the one you want to use, then point 1. should be okay.

For 2., this could be causing you to use a different CUDA when compiling (the one from the module) than when running (some other version on the system), and that tends to cause problems. It's important to run in the same environment that you've compiled in.

I think 3. is the cause of the mpiexec does not support recursive calls message. AMUSE uses MPI in a different way than most applications: instead of having many copies of your script running in parallel, there's only one copy, which will dynamically create parallel community code instances as needed within your allocation. So you should start your script without mpirun, AMUSE will call it itself if needed (it has other ways of starting models too).

LourensVeen commented 2 months ago

Oh, and about ./configure failing to detect the CUDA libraries, that's an interesting one. I'm currently working on the build system, and I've rewritten the CUDA detection logic because CUDA has changed over time and it could use an update. I'm going to check that the new system works with this directory layout, and if it doesn't, fix it.

Thanks for reporting this even if you worked around it already, it's much better to fix things like this on the AMUSE side where we can fix it for everyone else too.

fredt00 commented 2 months ago

Thanks for the advice! In terms of your suggestions:

  1. So foss/2022a is actually a bunch of modules, sorry I should have provided the full list:
    1) GCCcore/11.3.0                 4) GCC/11.3.0                      7) libxml2/2.9.13-GCCcore-11.3.0     10) OpenSSL/1.1                     13) libfabric/1.15.1-GCCcore-11.3.0  16) OpenMPI/4.1.4-GCC-11.3.0    19) FFTW/3.3.10-GCC-11.3.0       22) ScaLAPACK/2.2.0-gompi-2022a-fb
    2) zlib/1.2.12-GCCcore-11.3.0     5) numactl/2.0.14-GCCcore-11.3.0   8) libpciaccess/0.16-GCCcore-11.3.0  11) libevent/2.1.12-GCCcore-11.3.0  14) PMIx/4.1.2-GCCcore-11.3.0        17) OpenBLAS/0.3.20-GCC-11.3.0  20) gompi/2022a                  23) foss/2022a
    3) binutils/2.38-GCCcore-11.3.0   6) XZ/5.2.5-GCCcore-11.3.0         9) hwloc/2.7.1-GCCcore-11.3.0        12) UCX/1.12.1-GCCcore-11.3.0       15) UCC/1.0.0-GCCcore-11.3.0         18) FlexiBLAS/3.2.0-GCC-11.3.0  21) FFTW.MPI/3.3.10-gompi-2022a

2 and 3 are both good points. I've removed mpirun and loaded CUDA and now I just get the warning:

/home/oxfd1327/soft/amuse-gpu/amuse/src/amuse/rfi/core.py:964: UserWarning: MPI (unexpectedly?) not available, falling back to sockets channel
  warnings.warn("MPI (unexpectedly?) not available, falling back to sockets channel")

And my code runs, although I don't see any speed up compared to when I configured without GPUs so I'm wondering if it is configured correctly. Do you know of a way to confirm the GPU utilisation? Running nvidia-smi before the python call shows that I am being allocated the requested GPUs but I can't see any information about their usage with seff for example.

In my script petar is called with

self.bound=code(self.converter, mode='gpu',number_of_workers=code_number_of_workers)

Is this the correct way to get petar to use GPUs? I can't see any mention of GPUs in the petar interface files.

rieder commented 2 months ago

PeTar in AMUSE currently doesn’t use the GPU, this would require at least manually modifying the Makefile but probably more modifications.

fredt00 commented 1 month ago

Ah ok, that makes sense. I've been trying to see if FastKick will run on the GPUs but strangely I get this error every few bridge timesteps:

Traceback (most recent call last):
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 353, in <module>
    main(**o.__dict__)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/examples/fred/galaxy_cluster_master.py", line 268, in main
    integrator.evolve_model(time)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 598, in evolve_model
    return self.evolve_joined_leapfrog(tend, timestep)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 624, in evolve_joined_leapfrog
    self.kick_codes(timestep / 2.0)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 756, in kick_codes
    de += x.kick(dt)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 478, in kick
    self.kick_with_field_code(
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 516, in kick_with_field_code
    ax,ay,az=field_code.get_gravity_at_point(
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/couple/bridge.py", line 146, in get_gravity_at_point
    return code.get_gravity_at_point(radius, x, y, z)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
    result = self.method(*list_arguments, **keyword_arguments)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 166, in __call__
    object = self.precall()
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 215, in precall
    return self.definition.precall(self)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 373, in precall
    transition.do()
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/state.py", line 123, in do
    self.method.new_method()()
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
    result = self.method(*list_arguments, **keyword_arguments)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
    result = self.method(*list_arguments, **keyword_arguments)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 168, in __call__
    result = self.method(*list_arguments, **keyword_arguments)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 170, in __call__
    result = self.convert_result(result)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/methods.py", line 209, in convert_result
    return self.definition.convert_result(self, result)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 682, in convert_result
    return self.handle_return_value(method, result)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 614, in handle_as_unit
    unit.append_result_value(method, self, value, result)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 70, in append_result_value
    self.convert_result_value(method, definition, value)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 80, in convert_result_value
    definition.handle_errorcode(errorcode)
  File "/cosma/home/dp016/dc-thom14/soft/amuse-gpu/amuse/src/amuse/support/interface.py", line 586, in handle_errorcode
    raise exceptions.AmuseException(
amuse.support.exceptions.AmuseException: Error when calling 'commit_particles' of a '<class 'amuse.community.fastkick.interface.FastKick'>', errorcode is -3

It seems to happen randomly but usually after the third bridge timestep Any Idea what could be causing this? Is it a GPU configuration problem? It doesn't seem to happen with mode='cpu'