FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python
https://flamegpu.com
MIT License
105 stars 20 forks source link

Distributed Ensemble (MPI Support) #1090

Closed Robadob closed 9 months ago

Robadob commented 1 year ago

The implementation of MPI Ensembles within this PR is designed for each CUDAEnsemble to have exclusive access to all the GPUs available to it (or specified with the devices config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.

It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.

Full notes of identified edge cases are in the below todo list.


Closes #1073 Closes #1114


Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of CUDAEnsemble::getLogs from a std::vector to a std::map

Robadob commented 1 year ago

I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test).

The issue with MPI testing is that MPI_Init(), MPI_Finalize() can only be called once per process. With CUDAEnsemble auto cleaning up and triggering MPI_Finalize() which waits for all runners to also call it. A second MPI test case cannot be run.

Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync).

Requires discussion/agreement.

Simplest option would be to provide a CUDAEnsemble config to disable auto finalize, and expose a finalize wrapper to users.

The only possible use-case I can see for distributed ensemble calling CUDAEnsemble::simulate() multiple times, would be a large genetic algorithm. If we wish to support that, then it will be affected by this too.


Changes for err test

Add this to FLAMEGPU_STEP_FUNCTION(model_step)

    if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
        throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
    }

Add this to the actual test body, adjust through Off, Slow, Fast.

        ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;
Robadob commented 1 year ago

https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though.

Robadob commented 1 year ago

Had TestMPIEnsemble.local segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normal CUDAEnsemble. Unable to reproduce, repeating test passed. Possible rare race condition.

Robadob commented 1 year ago

Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test).

Robadob commented 1 year ago

Had TestMPIEnsemble.local segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normal CUDAEnsemble. Unable to reproduce, repeating test passed. Possible rare race condition.

This happened a second time. Currently just throwing it through gdb over and over to try and catch it.

Curiously this second time was directly after a recompile, so possibly only occurs when GPUs have dropped into low power state?

Robadob commented 1 year ago

Caught it, race-condition when adding the run log (previous we had pre-allocated a vector so no mutex was required.

ptheywood commented 1 year ago

I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once).

Robadob commented 1 year ago

As discussed with @ptheywood (on Slack), will move MPI_Finalize() to cleanup(), and replace with MPI_Barrier() (to ensure synchronisation before all workers leave the call to CUDAEnsemble::simulate().

This will require adjustments to the documentation and tests.

ptheywood commented 1 year ago

Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc.

(this is a note mostly for me when I review this in the near future)

Robadob commented 1 year ago

I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names.

Robadob commented 1 year ago

I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI.

ptheywood commented 1 year ago

Should we expose world_size/rank to HostAPI? (I don't think it's necessary)

I agree it's not neccessary to expose it ourselves, its globally available so anyone who needs it will be able to access it directly themselves (with appropriate guarding).

As there's only one rank per node (based on the docs PR) anyway, it doesn't help with uniqueness checks anyway as the rank will be the same for all simulations within a node currently anyway, so the run plan index or similar will still need using for uniqueness checks.

ptheywood commented 1 year ago

Thiks currently does not compile for my current MPI + CUDA + GCC versions.

With CUDA 12.2, GCC 11.4.0 and OpenMPI 4.1.2.

There's no MPI coverage on CI, which might not even have caught this if it is version specific.

Current error is:

/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/simulation/detail/AbstractSimRunner.h(55): error: expression must have a constant value
                                                             (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned))))
                                                                          ^

The offending line is

            constexpr MPI_Datatype array_of_types[count] = {MPI_UNSIGNED, MPI_UNSIGNED, MPI_UNSIGNED, MPI_CHAR};

Removing the constexpr qualifier allows this to compile.

Do you know which MPI / GCC / CUDA you compiled with previously where this worked?

We probably also need to pin down the oldest MPI we would support.

Robadob commented 1 year ago

Do you know which MPI / GCC / CUDA you compiled with previously where this worked?

Would be whatever my bashrc on waimu default to I guess

Robadob commented 1 year ago

as running it with MPI enabled means it's internal validation doesn't work.

That's why there's a disable mpi config option ;)

[100%] Built target ensemble
rob@waimea:~/FLAMEGPU2/build$ mpirun -n 2 bin/Debug/ensemble
CUDAEnsemble completed 100 runs successfully!
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200
Robadob commented 1 year ago

To be discussed:

Robadob commented 1 year ago

Agreed

Robadob commented 1 year ago

MPI file specific CI to test multiple MPI versions

I've created a new MPI workflow. However, as expected, it fails to install specific versions of mpich/openmpi via apt-get. Will wait for @ptheywood's return to advise on best method to make them available (my natural next step would be build from source).

add slurm script to docs.

I have semi-successfully ran the mpi test suite on Bede.

#!/bin/bash

# Generic options:

#SBATCH --account=bdshe03 # Run job under project <project>
#SBATCH --time=0:10:0         # Run for a max of 10 mins

# Node resources:

#SBATCH --partition=gpu    # Choose either "gpu" or "infer" node type
#SBATCH --nodes=2          # Resources from a two nodes
#SBATCH --gres=gpu:1       # 1 GPUs per node

# Run commands:

# 1ppn == 1 process per node
bede-mpirun --bede-par 1ppn ./build/bin/Release/tests_mpi

Produces the intermingled log

Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[gpu030.bede.dur.ac.uk:696907] pml_ucx.c:291  Error: Failed to create UCP worker
[gpu031.bede.dur.ac.uk:2817740] pml_ucx.c:291  Error: Failed to create UCP worker
[       OK ] TestMPIEnsemble.success (56576 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success (56500 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (50446 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.success_verbose (50446 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.error_off (50467 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_off (50467 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_slow (50463 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_slow (50462 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_fast (6057 ms)
[----------] 5 tests from TestMPIEnsemble (214011 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (214011 ms total)
[  PASSED  ] 5 tests.
[       OK ] TestMPIEnsemble.error_fast (6057 ms)
[----------] 5 tests from TestMPIEnsemble (213934 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (213934 ms total)
[  PASSED  ] 5 tests. 

Of note:

Robadob commented 1 year ago

Output from updated ensemble example using mpirun -n 2 on waimu (I didn't do any special hacks to make sure GPUs are unique but it still worked).

rob@waimea:~/FLAMEGPU2/build/bin/Debug$ mpirun -n 2 ./ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 218, calculated init 218
Ensemble result: 22162315712, calculated result 22162315712
Local MPI runner completed 51/100 runs.
Ensemble init: 232, calculated init 232
Ensemble result: 18081819488, calculated result 18081819488
Local MPI runner completed 49/100 runs.
Robadob commented 1 year ago

Status

mondus commented 1 year ago

Suggest to try with mvapich2 from Bede docs.

Robadob commented 1 year ago

Document how to run test suite with mpirun

Robadob commented 1 year ago

Suggest to try with mvapich2 from Bede docs.

When build with mvapich2, and executed using bede-mpirun this error is received (~3 attempts with slight changes).

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Google suggests it's an MPI misconfigured problem.

The mvapich2 mpirun docs aren't great, so it doesn't have a convenient parameter like openmpi for 1 process per node. Can't workout required commands to bypass bede-mpirun.

ptheywood commented 1 year ago

I've added a warning in CMake if using MPI and CMake < 3.20.1 to address #1114, which outputs the following.

CMake Warning at src/CMakeLists.txt:569 (message):
  CMake < 3.20.1 may result in link errors with FLAMEGPU_ENABLE_MPI=ON for
  some MPI installations.  Consider using CMake >= 3.20.1 to avoid linker
  errors.
ptheywood commented 11 months ago

Stanage znver3 (i.e. for the GPU nodes) include OpenMPI, so we can use Stanage's A100 and H100 nodes for an x86_64 single-node upto 4 GPU MPI bench as it currently stands.

From an A100/H100 node, i.e. the following successfully conifgured and compiled from a H100 node:

module load OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CUDA/11.8.0  CMake/3.24.3-GCCcore-11.3.0
mkdir -p build-11-8-mpi
cd build-11-8-mpi
cmake .. -DCMAKe_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_BUILD_TESTS=ON -DFLAMEGPU_ENABLE_MPI=ON
cmake --build . --target tests_mpi -j `nproc`

I then ran the MPI test suite using a single rank, with a single GPU in my interactive session

mpirun -n 1 bin/Release/tests_mpi

This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?

Running main() from /users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (101671 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (101131 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.error_off (100341 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_slow (100338 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_fast (10333 ms)
[----------] 5 tests from TestMPIEnsemble (413816 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (413817 ms total)
[  PASSED  ] 5 tests.

Trying to run with 4 processes on a single GPU and 20 CPU cores of an H100 node requires --oversubscribe due to the current stanage configuration (only one mpi slot available), but that configuration skips a number of tests due to the stall, and then mpirun reports the error as the google test proccesses which skip return a non zero error code

/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33200,1],0]
  Exit code:    1
--------------------------------------------------------------------------
Robadob commented 11 months ago

This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?

I think bede runs with 2 gpus were taking ~3 minutes, 1 gpu should be taking ~6 minutes.

Should be trivial to add a constant to scale the time I guess.

ptheywood commented 11 months ago

re: CI most MPI installs don't provide binary packages, and most linux distro's only package a single version of each MPI implementation.

I'm prototyping github action step(s) which install MPI from apt if specified or from source elsewhere (to avoid spamming long-running CI by pushing to this branch). Once it's sorted I'll add it to this branch.

ptheywood commented 11 months ago

MPI CI is pasing for all openmpi's, and MPICH's from source.

MPICH from apt is failing at link time. This appears to be related to -flto which is enabled for host object compilation (and passed to the host compiler for cuda objects), but it is not being passed at link time?

This is the only build which is adding flto, so presumably its an implicit option coming from the mpich installation somehow. I may be able to repro this locally?

ptheywood commented 11 months ago

Installing libmpich-dev on my ubuntu 22.04 install reproduces the error.

The MPICH distributed via ubuntu / debian is the source of the lto flags, as shown by the following

$ mpicxx -compile-info
g++ -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich

Not yet sure on how to resolve this in a way that will allow this build for end users.

Within my CMakeCache.txt, these flags are in MPI_CXX_COMPILE_OPTIONS, an advanced internal cache variable.

Potentially I could configure with this explicitly set to nothing, i.e. -DMPI_CXX_COMPILE_OPTIONS="" which may work, but that might break other mpi installations so isn't something we'd want to programatically set. It might also still not result in usable binaries anyway if this is actaully required.

Edit: Configuring with -DMPI_CXX_COMPILE_OPTIONS="" successfully builds with my local mpi build, and test_mpi appears to run successfully (200s in so far)

MPI_CXX_COMPILE_OPTIONS:STRING=-flto=auto;-ffat-lto-objects;-flto=auto

I could probably detect this at configure time and warn/error about it in CMake? Otherwise enabling LTO via CMake the "propper cmake" (INTERPROCEDURAL_OPTIMIZATION) way might work (I'm skeptical), but might also enable device lto.

ptheywood commented 11 months ago

It is possible to detect and warn about -flto at configuration time, e.g. in src/CMakeLists.txt after the 3.20.1 warning:

        # If the MPI installation brings in -flto (i.e. Ubuntu 22.04 libmpich-dev), warn about it and suggest a reconfiguration.
        if(MPI_CXX_COMPILE_OPTIONS MATCHES ".*\-flto.*")
            message(WARNING
                " MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors. \n"
                " Consider an alternate MPI implementation which does not embed -flto,\n"
                " Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS=\"\" if linker error occur.")
        endif()
$ cmake .. 
-- -----Configuring Project: flamegpu-----
-- CUDA Architectures: 86
-- RapidJSON found. Headers: /home/ptheywood/code/flamegpu/FLAMEGPU2/build-mpi/_deps/rapidjson-src/include
-- flamegpu version 2.0.0-rc.1+eed4987d
CMake Warning at src/CMakeLists.txt:576 (message):
   MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors. 
   Consider an alternate MPI implementation which does not embed -flto,
   Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS="" if linker error occur.

This won't fix CI though.

ptheywood commented 11 months ago

I've been back through the changes since I last reviewed this, leaving comments as neccesary. Will aim to do some more testing of this on a multi-node system (Bede) tomorrow if possible, and get my head around the potential stall situations just to make sure I understand it and it's not a wider problem.

Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).

ptheywood commented 11 months ago

From the Todo list for this pr:

Should we expose world_size/rank to HostAPI? (I don't think it's necessary)

No, this is accessible by MPI for anyone that really wants / needs it


We should probably check this works from python too though prior to merge, and maybe add a test case for that.

Robadob commented 11 months ago

Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).

I think I've addressed all your points that came through my emails.

We should probably check this works from python too though prior to merge, and maybe add a test case for that.

I can probably get to that Friday.

ptheywood commented 11 months ago

tl;dr

testing on Stanage using an interactive 2 a100 session:

Commit hash d529d6e81d7e6e4a7dfab56e4b235052dfc3700e


On a Stanage, using 1/2 of an a100 node (2 GPUs, 24 cores) interactively :

# Get an interactive session
srun --partition=gpu --qos=gpu --gres=gpu:a100:2 --mem=164G --cpus-per-task 24 --pty bash -i

Then in the interactive job (i.e. on an AMD cpu core, not the login intel CPU)

# Load Dependencies
module load CUDA/11.8.0 OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CMake/3.24.3-GCCcore-11.3.0

# Build dir
mkdir -p build-mpi-cu118-gcc-113-ompi414
cd build-mpi-cu118-gcc-113-ompi414
# Configure
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_ENABLE_MPI=ON -DFLAMEGPU_ENABLE_NVTX=ON -DFLAMEGPU_BUILD_TESTS=ON
# Compile
cmake --build . -j `nproc`

Ensemble Example

Running the ensemble example without explicitly using mpirun fails with an MPI error. This could be a mpi configuration thing, but ideally the binary should be usable without MPI, and just not use MPI if not requested (I'm not 100% sure this is possible for all MPI applications, given mpirun in non standard, so detecting if mpi is requested might not be possible all the time)

$ ./bin/Release/ensemble
[gpu04.pri.stanage.alces.network:33719] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[gpu04.pri.stanage.alces.network:33719] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Using mpirun (or mpiexec) with a single rank runs successfully, but the completed message is missing a run (or more, seen 95 as well)

$ mpirun -n 1 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 441, calculated init 441
Ensemble result: 40144235200, calculated result 40144235200
Local MPI runner completed 99/100 runs.

Using 2 ranks (2 GPUs) errors with the stanage configuration from an interactive job, as mpi only beleives one slot is available. This might just be how I requsted the job. However using the suggested options it can be ran (see below)

$ mpirun -n 2 ./bin/Release/ensemble
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  ./bin/Release/ensemble

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

Using `--oversubscribe`` works, and the MPI runner counts add up.

$ mpirun --oversubscribe -n 2 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 213, calculated init 213
Ensemble result: 17859919488, calculated result 17859919488
Local MPI runner completed 45/100 runs.
Ensemble init: 237, calculated init 237
Ensemble result: 22384215712, calculated result 22384215712
Local MPI runner completed 55/100 runs.

Using 4 threads and --oversubscribe also works, even though there are only 2 GPUs.

$ time mpiexec --oversubscribe -n 4 ./bin/Release/ensemble 
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 126, calculated init 128
Ensemble result: 11977340560, calculated result 11977540560
Local MPI runner completed 29/100 runs.
Ensemble init: 70, calculated init 69
Ensemble result: 6409861632, calculated result 6409761632
Local MPI runner completed 16/100 runs.
Ensemble init: 141, calculated init 140
Ensemble result: 11424857856, calculated result 11424757856
Local MPI runner completed 28/100 runs.
Ensemble init: 113, calculated init 113
Ensemble result: 10432075152, calculated result 10432075152
Local MPI runner completed 27/100 runs.

real    0m4.540s
user    0m2.145s
sys     0m16.240s

Test suite

The test suite does not behave as intended when running with MPIrun, it just runs all the tests twice with one rank each, rather than running each test once, but using multiple processes. This might be a google test limitation, in which case we might need to make each mpi test it's own binary, and orchestrate them via ctest (we can use categories to make it easy to just run the mpi tests)

$ mpirun --oversubscribe -n 2 ./bin/Release/tests_mpi 
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (10767 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success (10772 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.error_off (10030 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_off (10031 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46879 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46879 ms total)
[  PASSED  ] 5 tests.
[       OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46883 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46883 ms total)
[  PASSED  ] 5 tests.

Using a single rank test fail:

$ mpirun -n 1  ./bin/Release/tests_mpi 
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (5951 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (5105 ms)
[ RUN      ] TestMPIEnsemble.error_off
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:201: Failure
Expected equality of these values:
  err_count
    Which is: 0
  1u
    Which is: 1
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:204: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
  Actual: false
Expected: true
[  FAILED  ] TestMPIEnsemble.error_off (5106 ms)
[ RUN      ] TestMPIEnsemble.error_slow
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:232: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
  Actual: false
Expected: true
[  FAILED  ] TestMPIEnsemble.error_slow (5107 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[  FAILED  ] TestMPIEnsemble.error_fast (5106 ms)
[----------] 5 tests from TestMPIEnsemble (26377 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (26377 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] TestMPIEnsemble.error_off
[  FAILED  ] TestMPIEnsemble.error_slow
[  FAILED  ] TestMPIEnsemble.error_fast

 3 FAILED TESTS
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[56666,1],0]
  Exit code:    1
--------------------------------------------------------------------------
Robadob commented 11 months ago

Re: incorrect progress. I think this is a consequence of the reporting logic

const int progress = static_cast<int>(next_run) - static_cast<int>(TOTAL_RUNNERS * world_size);

It would never report higher than TOTAL_RUNS-TOTAL_RUNNERS with a single MPI rank, as it's reporting the id of the newly assigned job - the number of runners. But will jump to exit early as soon as next_run exceeds TOTAL_RUNS.

Regardless, I've changed it so the printf instead reports which job index has been assigned to which rank (and got rid of the \r).

Robadob commented 11 months ago

I've reworking progress printing, fixed bugs that were causing temperamental test failures, added a few extra tests for better coverage and added a single pyflamegpu capable MPI test (e.g. one that is very limited, but won't break if ran via MPI).

ptheywood commented 10 months ago

We've tested this on Bede again, using a few differnet MPI/GCC combos using 2 nodes, both the tests_mpi and ensemble targets

I'll try to get the other changes reviewed soon so we can merge this.

ptheywood commented 10 months ago

Code generally looks good now, still need to re-run things in a distributed setting etc.

One final idea for change, is to use MPI_Comm_split_type within the cuda ensemble to auto-assign gpus for multiple ranks in a single multigpu node, much like how this is handled in the existing tests to support single-rank testing.

I.e.

This is mainly so that systems where it is not possible to request 1 mpi rank per node (or not easily) can be supported. Using 1 rank per node would still (likely) be the best performance.

(This would also make for an interesting benchmark / would let some benchmarking scenarios be covered).

I might do this myself (leaving the comment so I don't forget)

ptheywood commented 10 months ago

I've prototyped creating a new MPI communicator which will only invovle mpi ranks which have been assigned a GPU, to prevent errors if users request more mpi ranks per node than are available, with those ranks doing nothing.

E.g.

int max_participating_ranks_per_node = 3; // pretend each rank has 3 gpus / 3 mpi ranks participating
    int color = local_rank < max_participating_ranks_per_node ? 0 : 1;

    if (MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &comm_participating) != MPI_SUCCESS) {
        fprintf(stderr, "Error creating communicator\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
    int participating_size, participating_rank;
    MPI_Comm_size(comm_participating, &participating_size);
    MPI_Comm_rank(comm_participating, &participating_rank);

Launching with 4 ranks each on 2 nodes, pretending there are 3 gpus each then shows the correct comms.

main rank 0: recieving (6 - 1) messages?
messages from: 2, 1, 4, 5, 6, 
ptheywood commented 10 months ago

MPI tests in debug builds are currently failing on mavericks due to captured output not matching what the test contains. This was the case prior to my device selection changes but hadn't been noticed (probably just hadn't been ran in debug mode for a while).

More ranks than GPUs ensemble example runs are working, but test suite is not behaving (I've likely broken assumptions in the tests now I've removed the check on the number of mpi ranks / device setting within the test suite)

Also not yet implemented what to do when devices are specified.

ptheywood commented 9 months ago

Release test suite all sorted now.

@Robadob - if you could skim the commits I've added to make sure you're happy enough with it that'd be appreciated. I've not touched the user-facing interface at all (other than changing implicit behaviour, that will need a tweak to the docs api).

Still need to:

Otherwise should be there now (I think I've got all my debug etc removed).

ptheywood commented 9 months ago

Test suite is fixed, python test suite is crashing in test_logging.py + maybe others for an mpi build. Had missed 2 printf's too.

doing a non-mpi build to figure out if its mpi build specific or not before attempting further debugging (I have a feeling there's a breaking change in here to do with a method return type which is probably the culprit).

Given the absence of a python_mpi tests sub suite, this might be a bit of effort. Might need to change conftest stuff for telemetry too.

ptheywood commented 9 months ago

5 ptest failures in the non-MPI build fails the python test suite with which need resolving.

2 have been fixed elsewhere, so just needs a rebase. Log related ones are an api break?

FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_function_condition - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_simulation - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate - AttributeError: 'int' object has no attribute 'getStepLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setExitLog - AttributeError: 'int' object has no attribute 'getExitLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setStepLog - AttributeError: 'int' object has no attribute 'getStepLog'
ptheywood commented 9 months ago

Tests now fixed by skipping, exposing pyflamegpu.MPI indicating if MPI was enabled for the build or not.

Ran on bede using

ptheywood commented 9 months ago

Will remove the ternary and associated checks for it to be nullptr'd tomorrow, then should be goood to merge I think. (+ update the docs pr to reflect the change to the requirement of 1 rank per node, instead to be up to 1 rank per gpu (or a warning will be given)

ptheywood commented 9 months ago

Tweaked the final suggestions, so think this is good to go now.

Not sure Rob can review this PR though, so I will have to approve (or Paul)?