Robadob commented 1 year ago

The implementation of MPI Ensembles within this PR is designed for each CUDAEnsemble to have exclusive access to all the GPUs available to it (or specified with the devices config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.

It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.

Full notes of identified edge cases are in the below todo list.

[x] Setup CMake
[x] Update CUDAEnsemble
- [x] Handle the 3 possible error configs
- [x] Do something about programmatic output logs (currently each copy will provide full RunLog, with those not handled empty of data, checking step counter > 0 is a hack, this presumably also affects failed runs under normal error modes)
  - Each runner has their own logs in RunLog (This requires an API break, as RunLog vector does not identify run index or RunPlan (it was initially assumed that vector would be parallel with RunPlanVector used as input). [Agreed with Paul 2023-07-26]
  - ~~Rank 0 gets all the logs, others get empty~~
  - ~~All runners get all logs~~
- ~~Catch and handle local MPI runs inside CUDAEnsemble ? (we want users to avoid using multiple MPI runners that have access to same GPU)~~
- ~~Should we expose world_size/rank to HostAPI? (I don't think it's necessary)~~
[x] Design a test case
[x] Test on Mavericks/Waimu (single node)
[x] Test on Bede (multi node)
[x] Document (readme)
[x] Document (userguide) https://github.com/FLAMEGPU/FLAMEGPU2-docs/pull/156
[x] Update telemetry to account for MPI
[x] Do something with ensemble example? (e.g. set CudaEnsemble::Config().mpi=false;)
~~Do we need to handle a race condition with RTC cache?~~

Closes #1073 Closes #1114

Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of CUDAEnsemble::getLogs from a std::vector to a std::map

Robadob commented 1 year ago

I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test).

The issue with MPI testing is that MPI_Init(), MPI_Finalize() can only be called once per process. With CUDAEnsemble auto cleaning up and triggering MPI_Finalize() which waits for all runners to also call it. A second MPI test case cannot be run.

Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync).

Requires discussion/agreement.

Simplest option would be to provide a CUDAEnsemble config to disable auto finalize, and expose a finalize wrapper to users.

The only possible use-case I can see for distributed ensemble calling CUDAEnsemble::simulate() multiple times, would be a large genetic algorithm. If we wish to support that, then it will be affected by this too.

Changes for err test

Add this to FLAMEGPU_STEP_FUNCTION(model_step)

    if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
        throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
    }

Add this to the actual test body, adjust through Off, Slow, Fast.

        ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;

Robadob commented 1 year ago

https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though.

Robadob commented 1 year ago

Had TestMPIEnsemble.local segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normal CUDAEnsemble. Unable to reproduce, repeating test passed. Possible rare race condition.

Robadob commented 1 year ago

Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test).

Robadob commented 1 year ago

Had TestMPIEnsemble.local segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normal CUDAEnsemble. Unable to reproduce, repeating test passed. Possible rare race condition.

This happened a second time. Currently just throwing it through gdb over and over to try and catch it.

Curiously this second time was directly after a recompile, so possibly only occurs when GPUs have dropped into low power state?

Robadob commented 1 year ago

Caught it, race-condition when adding the run log (previous we had pre-allocated a vector so no mutex was required.

ptheywood commented 1 year ago

I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once).

Robadob commented 1 year ago

As discussed with @ptheywood (on Slack), will move MPI_Finalize() to cleanup(), and replace with MPI_Barrier() (to ensure synchronisation before all workers leave the call to CUDAEnsemble::simulate().

This will require adjustments to the documentation and tests.

ptheywood commented 1 year ago

Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc.

(this is a note mostly for me when I review this in the near future)

Robadob commented 1 year ago

I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names.

Robadob commented 1 year ago

I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI.

ptheywood commented 1 year ago

Should we expose world_size/rank to HostAPI? (I don't think it's necessary)

I agree it's not neccessary to expose it ourselves, its globally available so anyone who needs it will be able to access it directly themselves (with appropriate guarding).

As there's only one rank per node (based on the docs PR) anyway, it doesn't help with uniqueness checks anyway as the rank will be the same for all simulations within a node currently anyway, so the run plan index or similar will still need using for uniqueness checks.

ptheywood commented 1 year ago

Thiks currently does not compile for my current MPI + CUDA + GCC versions.

With CUDA 12.2, GCC 11.4.0 and OpenMPI 4.1.2.

There's no MPI coverage on CI, which might not even have caught this if it is version specific.

Current error is:

/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/simulation/detail/AbstractSimRunner.h(55): error: expression must have a constant value
                                                             (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned))))
                                                                          ^

The offending line is

            constexpr MPI_Datatype array_of_types[count] = {MPI_UNSIGNED, MPI_UNSIGNED, MPI_UNSIGNED, MPI_CHAR};

Removing the constexpr qualifier allows this to compile.

Do you know which MPI / GCC / CUDA you compiled with previously where this worked?

We probably also need to pin down the oldest MPI we would support.

Robadob commented 1 year ago

Do you know which MPI / GCC / CUDA you compiled with previously where this worked?

Would be whatever my bashrc on waimu default to I guess

Robadob commented 1 year ago

as running it with MPI enabled means it's internal validation doesn't work.

That's why there's a disable mpi config option ;)

[100%] Built target ensemble
rob@waimea:~/FLAMEGPU2/build$ mpirun -n 2 bin/Debug/ensemble
CUDAEnsemble completed 100 runs successfully!
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200

Robadob commented 1 year ago

To be discussed:

Get rid of --no-mpi? (Pete think it's redundant)
Split out MPI tests into a separate suite?
Add an example 1 worker per node SLURM script to docs
Easy way to test MPI 2.x support? (or just try and check the small number of used MPI fns vs spec, though I doubt there's as nice a website as docs.gl)

Robadob commented 1 year ago

Agreed

[x] Get rid of --no-mpi
[x] Split out MPI tests into a separate suite
[ ] MPI file specific CI to test multiple MPI versions
[ ] add slurm script to docs.
[x] ensemble example fix

Robadob commented 1 year ago

MPI file specific CI to test multiple MPI versions

I've created a new MPI workflow. However, as expected, it fails to install specific versions of mpich/openmpi via apt-get. Will wait for @ptheywood's return to advise on best method to make them available (my natural next step would be build from source).

add slurm script to docs.

I have semi-successfully ran the mpi test suite on Bede.

#!/bin/bash

# Generic options:

#SBATCH --account=bdshe03 # Run job under project <project>
#SBATCH --time=0:10:0         # Run for a max of 10 mins

# Node resources:

#SBATCH --partition=gpu    # Choose either "gpu" or "infer" node type
#SBATCH --nodes=2          # Resources from a two nodes
#SBATCH --gres=gpu:1       # 1 GPUs per node

# Run commands:

# 1ppn == 1 process per node
bede-mpirun --bede-par 1ppn ./build/bin/Release/tests_mpi

Produces the intermingled log

Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[0;32m[==========] [mRunning 5 tests from 1 test suite.
[0;32m[----------] [mGlobal test environment set-up.
[0;32m[----------] [m5 tests from TestMPIEnsemble
[0;32m[ RUN      ] [mTestMPIEnsemble.success
Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[0;32m[==========] [mRunning 5 tests from 1 test suite.
[0;32m[----------] [mGlobal test environment set-up.
[0;32m[----------] [m5 tests from TestMPIEnsemble
[0;32m[ RUN      ] [mTestMPIEnsemble.success
[gpu030.bede.dur.ac.uk:696907] pml_ucx.c:291  Error: Failed to create UCP worker
[gpu031.bede.dur.ac.uk:2817740] pml_ucx.c:291  Error: Failed to create UCP worker
[0;32m[       OK ] [mTestMPIEnsemble.success (56576 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.success_verbose
[0;32m[       OK ] [mTestMPIEnsemble.success (56500 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.success_verbose
[0;32m[       OK ] [mTestMPIEnsemble.success_verbose (50446 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_off
[0;32m[       OK ] [mTestMPIEnsemble.success_verbose (50446 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_off
[0;32m[       OK ] [mTestMPIEnsemble.error_off (50467 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_slow
[0;32m[       OK ] [mTestMPIEnsemble.error_off (50467 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_slow
[0;32m[       OK ] [mTestMPIEnsemble.error_slow (50463 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_fast
[0;32m[       OK ] [mTestMPIEnsemble.error_slow (50462 ms)
[0;32m[ RUN      ] [mTestMPIEnsemble.error_fast
[0;32m[       OK ] [mTestMPIEnsemble.error_fast (6057 ms)
[0;32m[----------] [m5 tests from TestMPIEnsemble (214011 ms total)

[0;32m[----------] [mGlobal test environment tear-down
[0;32m[==========] [m5 tests from 1 test suite ran. (214011 ms total)
[0;32m[  PASSED  ] [m5 tests.
[0;32m[       OK ] [mTestMPIEnsemble.error_fast (6057 ms)
[0;32m[----------] [m5 tests from TestMPIEnsemble (213934 ms total)

[0;32m[----------] [mGlobal test environment tear-down
[0;32m[==========] [m5 tests from 1 test suite ran. (213934 ms total)
[0;32m[  PASSED  ] [m5 tests.

Of note:

The tests passed on both MPI workers
"Error: Failed to create UCP worker", not clear what this mean or the impact (googling suggests might be related to infiniband), or how to resolve.
Bede has it's own bespoke bede-mpirun command which takes custom args along with mpirun args. Is this standard? It kind of harms our plan to provide a sample SLURM script. docs suggest the mpirun equivalent is either -N, -npernode or --npernode or simply -pernode/--pernode which is equivalent of npernode=1.
- Changing the SLURM script to use --pernode rather than --bede-par 1ppn still reported the UCP failure. Additionally, the tests failed on one worker because the output stream was not empty. Nothing obvious in the log as to the cause, looks identical to previous runs beside the fail msg. Repeated this run, and that failure persisted with an additional failure (errcount 1, expected 0). Unclear if this is because mpirun is being used rather than bede-mpirun or what.

Robadob commented 1 year ago

Output from updated ensemble example using mpirun -n 2 on waimu (I didn't do any special hacks to make sure GPUs are unique but it still worked).

rob@waimea:~/FLAMEGPU2/build/bin/Debug$ mpirun -n 2 ./ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 218, calculated init 218
Ensemble result: 22162315712, calculated result 22162315712
Local MPI runner completed 51/100 runs.
Ensemble init: 232, calculated init 232
Ensemble result: 18081819488, calculated result 18081819488
Local MPI runner completed 49/100 runs.

Robadob commented 1 year ago

Status

Need some input from @ptheywood as to how to setup MPI CI (in particular, how to install specific versions)
Need to discuss the Bede failures:
- Do we understand "Error: Failed to create UCP worker" does it matter?
- Do the additional failures when using mpirun rather than bede-mpirun make sense?
- How does this affect plans to document with a slurm script.

mondus commented 1 year ago

Suggest to try with mvapich2 from Bede docs.

Robadob commented 1 year ago

Document how to run test suite with mpirun

Robadob commented 1 year ago

Suggest to try with mvapich2 from Bede docs.

When build with mvapich2, and executed using bede-mpirun this error is received (~3 attempts with slight changes).

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Google suggests it's an MPI misconfigured problem.

The mvapich2 mpirun docs aren't great, so it doesn't have a convenient parameter like openmpi for 1 process per node. Can't workout required commands to bypass bede-mpirun.

ptheywood commented 1 year ago

I've added a warning in CMake if using MPI and CMake < 3.20.1 to address #1114, which outputs the following.

CMake Warning at src/CMakeLists.txt:569 (message):
  CMake < 3.20.1 may result in link errors with FLAMEGPU_ENABLE_MPI=ON for
  some MPI installations.  Consider using CMake >= 3.20.1 to avoid linker
  errors.

ptheywood commented 1 year ago

Stanage znver3 (i.e. for the GPU nodes) include OpenMPI, so we can use Stanage's A100 and H100 nodes for an x86_64 single-node upto 4 GPU MPI bench as it currently stands.

From an A100/H100 node, i.e. the following successfully conifgured and compiled from a H100 node:

module load OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CUDA/11.8.0  CMake/3.24.3-GCCcore-11.3.0
mkdir -p build-11-8-mpi
cd build-11-8-mpi
cmake .. -DCMAKe_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_BUILD_TESTS=ON -DFLAMEGPU_ENABLE_MPI=ON
cmake --build . --target tests_mpi -j `nproc`

I then ran the MPI test suite using a single rank, with a single GPU in my interactive session

mpirun -n 1 bin/Release/tests_mpi

This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?

Running main() from /users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (101671 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (101131 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.error_off (100341 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_slow (100338 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_fast (10333 ms)
[----------] 5 tests from TestMPIEnsemble (413816 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (413817 ms total)
[  PASSED  ] 5 tests.

Trying to run with 4 processes on a single GPU and 20 CPU cores of an H100 node requires --oversubscribe due to the current stanage configuration (only one mpi slot available), but that configuration skips a number of tests due to the stall, and then mpirun reports the error as the google test proccesses which skip return a non zero error code

/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33200,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Robadob commented 1 year ago

This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?

I think bede runs with 2 gpus were taking ~3 minutes, 1 gpu should be taking ~6 minutes.

Should be trivial to add a constant to scale the time I guess.

ptheywood commented 1 year ago

re: CI most MPI installs don't provide binary packages, and most linux distro's only package a single version of each MPI implementation.

I'm prototyping github action step(s) which install MPI from apt if specified or from source elsewhere (to avoid spamming long-running CI by pushing to this branch). Once it's sorted I'll add it to this branch.

ptheywood commented 1 year ago

MPI CI is pasing for all openmpi's, and MPICH's from source.

MPICH from apt is failing at link time. This appears to be related to -flto which is enabled for host object compilation (and passed to the host compiler for cuda objects), but it is not being passed at link time?

This is the only build which is adding flto, so presumably its an implicit option coming from the mpich installation somehow. I may be able to repro this locally?

ptheywood commented 1 year ago

Installing libmpich-dev on my ubuntu 22.04 install reproduces the error.

The MPICH distributed via ubuntu / debian is the source of the lto flags, as shown by the following

$ mpicxx -compile-info
g++ -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich

Not yet sure on how to resolve this in a way that will allow this build for end users.

Within my CMakeCache.txt, these flags are in MPI_CXX_COMPILE_OPTIONS, an advanced internal cache variable.

Potentially I could configure with this explicitly set to nothing, i.e. -DMPI_CXX_COMPILE_OPTIONS="" which may work, but that might break other mpi installations so isn't something we'd want to programatically set. It might also still not result in usable binaries anyway if this is actaully required.

Edit: Configuring with -DMPI_CXX_COMPILE_OPTIONS="" successfully builds with my local mpi build, and test_mpi appears to run successfully (200s in so far)

MPI_CXX_COMPILE_OPTIONS:STRING=-flto=auto;-ffat-lto-objects;-flto=auto

I could probably detect this at configure time and warn/error about it in CMake? Otherwise enabling LTO via CMake the "propper cmake" (INTERPROCEDURAL_OPTIMIZATION) way might work (I'm skeptical), but might also enable device lto.

ptheywood commented 1 year ago

It is possible to detect and warn about -flto at configuration time, e.g. in src/CMakeLists.txt after the 3.20.1 warning:

        # If the MPI installation brings in -flto (i.e. Ubuntu 22.04 libmpich-dev), warn about it and suggest a reconfiguration.
        if(MPI_CXX_COMPILE_OPTIONS MATCHES ".*\-flto.*")
            message(WARNING
                " MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors. \n"
                " Consider an alternate MPI implementation which does not embed -flto,\n"
                " Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS=\"\" if linker error occur.")
        endif()

$ cmake .. 
-- -----Configuring Project: flamegpu-----
-- CUDA Architectures: 86
-- RapidJSON found. Headers: /home/ptheywood/code/flamegpu/FLAMEGPU2/build-mpi/_deps/rapidjson-src/include
-- flamegpu version 2.0.0-rc.1+eed4987d
CMake Warning at src/CMakeLists.txt:576 (message):
   MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors. 
   Consider an alternate MPI implementation which does not embed -flto,
   Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS="" if linker error occur.

This won't fix CI though.

ptheywood commented 1 year ago

I've been back through the changes since I last reviewed this, leaving comments as neccesary. Will aim to do some more testing of this on a multi-node system (Bede) tomorrow if possible, and get my head around the potential stall situations just to make sure I understand it and it's not a wider problem.

Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).

ptheywood commented 1 year ago

From the Todo list for this pr:

Should we expose world_size/rank to HostAPI? (I don't think it's necessary)

No, this is accessible by MPI for anyone that really wants / needs it

We should probably check this works from python too though prior to merge, and maybe add a test case for that.

Robadob commented 1 year ago

Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).

I think I've addressed all your points that came through my emails.

We should probably check this works from python too though prior to merge, and maybe add a test case for that.

I can probably get to that Friday.

ptheywood commented 1 year ago

tl;dr

testing on Stanage using an interactive 2 a100 session:

Running without mpirun/mpiexec fails due to mpi detecting srun but not being built for it. Ideally this should just run fine without, but not sure how feasible that is on all systems.
- Edit: A quick search suggests there is no standard/portable way to detect if an mpi binary was launched via mpirun/exec or not, so don't think we can fix this (and users should either always use mpirun/exec, or use a non mpi build).
Ensemble example run count is reported as 99/100 when using 1 rank
Using 2/4 ranks runs with --oversubscribe and the count is correct. Unclear which ranks use which GPUs however.
The test suite does not behave correctly with mpirun -n 2. It runs all tests twice independently, rather than running each test once but with 2 ranks. This might be a google test limitation, so 1 non google test test per binary and a ctest run might be the solution.
The test suite does not behave correctly with mpiurun -n 1 with several test failures
When using multiple ranks within a multigpu node, both processes use all gpus by default. is this easy to configure in some way for local testing? a mutli-node setup will not always be available.

Commit hash d529d6e81d7e6e4a7dfab56e4b235052dfc3700e

On a Stanage, using 1/2 of an a100 node (2 GPUs, 24 cores) interactively :

# Get an interactive session
srun --partition=gpu --qos=gpu --gres=gpu:a100:2 --mem=164G --cpus-per-task 24 --pty bash -i

Then in the interactive job (i.e. on an AMD cpu core, not the login intel CPU)

# Load Dependencies
module load CUDA/11.8.0 OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CMake/3.24.3-GCCcore-11.3.0

# Build dir
mkdir -p build-mpi-cu118-gcc-113-ompi414
cd build-mpi-cu118-gcc-113-ompi414
# Configure
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_ENABLE_MPI=ON -DFLAMEGPU_ENABLE_NVTX=ON -DFLAMEGPU_BUILD_TESTS=ON
# Compile
cmake --build . -j `nproc`

Ensemble Example

Running the ensemble example without explicitly using mpirun fails with an MPI error. This could be a mpi configuration thing, but ideally the binary should be usable without MPI, and just not use MPI if not requested (I'm not 100% sure this is possible for all MPI applications, given mpirun in non standard, so detecting if mpi is requested might not be possible all the time)

$ ./bin/Release/ensemble
[gpu04.pri.stanage.alces.network:33719] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[gpu04.pri.stanage.alces.network:33719] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Using mpirun (or mpiexec) with a single rank runs successfully, but the completed message is missing a run (or more, seen 95 as well)

$ mpirun -n 1 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 441, calculated init 441
Ensemble result: 40144235200, calculated result 40144235200
Local MPI runner completed 99/100 runs.

Using 2 ranks (2 GPUs) errors with the stanage configuration from an interactive job, as mpi only beleives one slot is available. This might just be how I requsted the job. However using the suggested options it can be ran (see below)

$ mpirun -n 2 ./bin/Release/ensemble
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  ./bin/Release/ensemble

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

Using `--oversubscribe`` works, and the MPI runner counts add up.

$ mpirun --oversubscribe -n 2 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 213, calculated init 213
Ensemble result: 17859919488, calculated result 17859919488
Local MPI runner completed 45/100 runs.
Ensemble init: 237, calculated init 237
Ensemble result: 22384215712, calculated result 22384215712
Local MPI runner completed 55/100 runs.

Using 4 threads and --oversubscribe also works, even though there are only 2 GPUs.

$ time mpiexec --oversubscribe -n 4 ./bin/Release/ensemble 
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 126, calculated init 128
Ensemble result: 11977340560, calculated result 11977540560
Local MPI runner completed 29/100 runs.
Ensemble init: 70, calculated init 69
Ensemble result: 6409861632, calculated result 6409761632
Local MPI runner completed 16/100 runs.
Ensemble init: 141, calculated init 140
Ensemble result: 11424857856, calculated result 11424757856
Local MPI runner completed 28/100 runs.
Ensemble init: 113, calculated init 113
Ensemble result: 10432075152, calculated result 10432075152
Local MPI runner completed 27/100 runs.

real    0m4.540s
user    0m2.145s
sys     0m16.240s

Test suite

The test suite does not behave as intended when running with MPIrun, it just runs all the tests twice with one rank each, rather than running each test once, but using multiple processes. This might be a google test limitation, in which case we might need to make each mpi test it's own binary, and orchestrate them via ctest (we can use categories to make it easy to just run the mpi tests)

$ mpirun --oversubscribe -n 2 ./bin/Release/tests_mpi 
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (10767 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success (10772 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN      ] TestMPIEnsemble.error_off
[       OK ] TestMPIEnsemble.error_off (10030 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_off (10031 ms)
[ RUN      ] TestMPIEnsemble.error_slow
[       OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[       OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46879 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46879 ms total)
[  PASSED  ] 5 tests.
[       OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46883 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46883 ms total)
[  PASSED  ] 5 tests.

Using a single rank test fail:

$ mpirun -n 1  ./bin/Release/tests_mpi 
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN      ] TestMPIEnsemble.success
[       OK ] TestMPIEnsemble.success (5951 ms)
[ RUN      ] TestMPIEnsemble.success_verbose
[       OK ] TestMPIEnsemble.success_verbose (5105 ms)
[ RUN      ] TestMPIEnsemble.error_off
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:201: Failure
Expected equality of these values:
  err_count
    Which is: 0
  1u
    Which is: 1
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:204: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
  Actual: false
Expected: true
[  FAILED  ] TestMPIEnsemble.error_off (5106 ms)
[ RUN      ] TestMPIEnsemble.error_slow
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:232: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
  Actual: false
Expected: true
[  FAILED  ] TestMPIEnsemble.error_slow (5107 ms)
[ RUN      ] TestMPIEnsemble.error_fast
[  FAILED  ] TestMPIEnsemble.error_fast (5106 ms)
[----------] 5 tests from TestMPIEnsemble (26377 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (26377 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] TestMPIEnsemble.error_off
[  FAILED  ] TestMPIEnsemble.error_slow
[  FAILED  ] TestMPIEnsemble.error_fast

 3 FAILED TESTS
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[56666,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Robadob commented 1 year ago

Re: incorrect progress. I think this is a consequence of the reporting logic

const int progress = static_cast<int>(next_run) - static_cast<int>(TOTAL_RUNNERS * world_size);

It would never report higher than TOTAL_RUNS-TOTAL_RUNNERS with a single MPI rank, as it's reporting the id of the newly assigned job - the number of runners. But will jump to exit early as soon as next_run exceeds TOTAL_RUNS.

Regardless, I've changed it so the printf instead reports which job index has been assigned to which rank (and got rid of the \r).

Robadob commented 1 year ago

I've reworking progress printing, fixed bugs that were causing temperamental test failures, added a few extra tests for better coverage and added a single pyflamegpu capable MPI test (e.g. one that is very limited, but won't break if ran via MPI).

ptheywood commented 1 year ago

We've tested this on Bede again, using a few differnet MPI/GCC combos using 2 nodes, both the tests_mpi and ensemble targets

OpenMPI + GCC 12 + CUDA 12 runs successfully with no warning
OpenMPI + GCC 10 + CUDA 11.4 runs suscessfully, but outputs an MPI warning. This is specific to the GCC version though, and does not occur in the newer version so I'll query if this is a known issue or not, but won't it won't block this PR.

I'll try to get the other changes reviewed soon so we can merge this.

ptheywood commented 1 year ago

Code generally looks good now, still need to re-run things in a distributed setting etc.

One final idea for change, is to use MPI_Comm_split_type within the cuda ensemble to auto-assign gpus for multiple ranks in a single multigpu node, much like how this is handled in the existing tests to support single-rank testing.

I.e.

Split the communicator to get the number of mpi ranks on per node
Get the number of gpus for the node the current rank is executing on.
If the user specified the gpus for the ensemble, do what they said unless it will break things
If not, load balance the gpus within the node across the mpi ranks
- If group_size == 1, use all gpus
- If group_size > 1 && group_size < num_gpus, load balance (do integer division to get num per gpu, assignt that until run out of gpus. so balancedunless say 3 ranks and 4 gpus, in which case it'd be a 1:1:2 split)
- If group_size > num_gpus, use 1 gpu per mpi rank, and the rest do nothing (maybe emit a warning). Or it could be an error

This is mainly so that systems where it is not possible to request 1 mpi rank per node (or not easily) can be supported. Using 1 rank per node would still (likely) be the best performance.

(This would also make for an interesting benchmark / would let some benchmarking scenarios be covered).

I might do this myself (leaving the comment so I don't forget)

ptheywood commented 1 year ago

I've prototyped creating a new MPI communicator which will only invovle mpi ranks which have been assigned a GPU, to prevent errors if users request more mpi ranks per node than are available, with those ranks doing nothing.

E.g.

int max_participating_ranks_per_node = 3; // pretend each rank has 3 gpus / 3 mpi ranks participating
    int color = local_rank < max_participating_ranks_per_node ? 0 : 1;

    if (MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &comm_participating) != MPI_SUCCESS) {
        fprintf(stderr, "Error creating communicator\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
    int participating_size, participating_rank;
    MPI_Comm_size(comm_participating, &participating_size);
    MPI_Comm_rank(comm_participating, &participating_rank);

Launching with 4 ranks each on 2 nodes, pretending there are 3 gpus each then shows the correct comms.

main rank 0: recieving (6 - 1) messages?
messages from: 2, 1, 4, 5, 6,

ptheywood commented 1 year ago

MPI tests in debug builds are currently failing on mavericks due to captured output not matching what the test contains. This was the case prior to my device selection changes but hadn't been noticed (probably just hadn't been ran in debug mode for a while).

More ranks than GPUs ensemble example runs are working, but test suite is not behaving (I've likely broken assumptions in the tests now I've removed the check on the number of mpi ranks / device setting within the test suite)

Also not yet implemented what to do when devices are specified.

ptheywood commented 11 months ago

Release test suite all sorted now.

@Robadob - if you could skim the commits I've added to make sure you're happy enough with it that'd be appreciated. I've not touched the user-facing interface at all (other than changing implicit behaviour, that will need a tweak to the docs api).

Still need to:

[x] Fix running the test suite in debug (this was an issue before my changes)
[x] Python...
- Should make sure it works in python fine still / in general, i.e. add atleast one explicit test
- Probably need to tweak the python conftest.py so it doesn't telemetry from each rank, but that will be grim.
- No idea if pytest will be happy when executed using mpi or not.
[x] Run it in a distributed context again (bede) once all other issues resolved.

Otherwise should be there now (I think I've got all my debug etc removed).

ptheywood commented 11 months ago

Test suite is fixed, python test suite is crashing in test_logging.py + maybe others for an mpi build. Had missed 2 printf's too.

doing a non-mpi build to figure out if its mpi build specific or not before attempting further debugging (I have a feeling there's a breaking change in here to do with a method return type which is probably the culprit).

Given the absence of a python_mpi tests sub suite, this might be a bit of effort. Might need to change conftest stuff for telemetry too.

ptheywood commented 11 months ago

5 ptest failures in the non-MPI build fails the python test suite with which need resolving.

2 have been fixed elsewhere, so just needs a rebase. Log related ones are an api break?

FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_function_condition - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_simulation - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate - AttributeError: 'int' object has no attribute 'getStepLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setExitLog - AttributeError: 'int' object has no attribute 'getExitLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setStepLog - AttributeError: 'int' object has no attribute 'getStepLog'

ptheywood commented 11 months ago

Tests now fixed by skipping, exposing pyflamegpu.MPI indicating if MPI was enabled for the build or not.

Ran on bede using

openmpo with 2 nodes, 1 gpu per node and one rank per node, ensemble and tests were fine, on T4 and V100 nodes
openmpi wwith 2 nodes, 2 gpus per node and 2 ranks per node on T4's, this errored due to an mpi error which has an open issue (power + infinband + gpus), so not an "us" problem.
mvapich, 2 infer nodes, 2 gpus per node, 2 ranks per node. ran fine.

ptheywood commented 11 months ago

Will remove the ternary and associated checks for it to be nullptr'd tomorrow, then should be goood to merge I think. (+ update the docs pr to reflect the change to the requirement of 1 rank per node, instead to be up to 1 rank per gpu (or a warning will be given)

ptheywood commented 11 months ago

Tweaked the final suggestions, so think this is good to go now.

Not sure Rob can review this PR though, so I will have to approve (or Paul)?

FLAMEGPU / FLAMEGPU2

Distributed Ensemble (MPI Support) #1090

Status

tl;dr

Ensemble Example

Test suite