Closed Robadob closed 11 months ago
I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test).
The issue with MPI testing is that MPI_Init()
, MPI_Finalize()
can only be called once per process. With CUDAEnsemble
auto cleaning up and triggering MPI_Finalize()
which waits for all runners to also call it. A second MPI test case cannot be run.
Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync).
Requires discussion/agreement.
Simplest option would be to provide a CUDAEnsemble
config to disable auto finalize, and expose a finalize wrapper to users.
The only possible use-case I can see for distributed ensemble calling CUDAEnsemble::simulate()
multiple times, would be a large genetic algorithm. If we wish to support that, then it will be affected by this too.
Changes for err test
Add this to FLAMEGPU_STEP_FUNCTION(model_step)
if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
}
Add this to the actual test body, adjust through Off
, Slow
, Fast
.
ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;
https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though.
Had TestMPIEnsemble.local
segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normal CUDAEnsemble
. Unable to reproduce, repeating test passed. Possible rare race condition.
Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test).
Had
TestMPIEnsemble.local
segfault at ~13/100 running on Waimea without mpi (at which point it should bypass MPI and just run as normalCUDAEnsemble
. Unable to reproduce, repeating test passed. Possible rare race condition.
This happened a second time. Currently just throwing it through gdb over and over to try and catch it.
Curiously this second time was directly after a recompile, so possibly only occurs when GPUs have dropped into low power state?
Caught it, race-condition when adding the run log (previous we had pre-allocated a vector so no mutex was required.
I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once).
As discussed with @ptheywood (on Slack), will move MPI_Finalize()
to cleanup()
, and replace with MPI_Barrier()
(to ensure synchronisation before all workers leave the call to CUDAEnsemble::simulate()
.
This will require adjustments to the documentation and tests.
Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc.
(this is a note mostly for me when I review this in the near future)
I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names.
I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI.
Should we expose world_size/rank to HostAPI? (I don't think it's necessary)
I agree it's not neccessary to expose it ourselves, its globally available so anyone who needs it will be able to access it directly themselves (with appropriate guarding).
As there's only one rank per node (based on the docs PR) anyway, it doesn't help with uniqueness checks anyway as the rank will be the same for all simulations within a node currently anyway, so the run plan index or similar will still need using for uniqueness checks.
Thiks currently does not compile for my current MPI + CUDA + GCC versions.
With CUDA 12.2, GCC 11.4.0 and OpenMPI 4.1.2.
There's no MPI coverage on CI, which might not even have caught this if it is version specific.
Current error is:
/home/ptheywood/code/flamegpu/FLAMEGPU2/include/flamegpu/simulation/detail/AbstractSimRunner.h(55): error: expression must have a constant value
(static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned))))
^
The offending line is
constexpr MPI_Datatype array_of_types[count] = {MPI_UNSIGNED, MPI_UNSIGNED, MPI_UNSIGNED, MPI_CHAR};
Removing the constexpr
qualifier allows this to compile.
Do you know which MPI / GCC / CUDA you compiled with previously where this worked?
We probably also need to pin down the oldest MPI we would support.
Do you know which MPI / GCC / CUDA you compiled with previously where this worked?
Would be whatever my bashrc on waimu default to I guess
as running it with MPI enabled means it's internal validation doesn't work.
That's why there's a disable mpi config option ;)
[100%] Built target ensemble
rob@waimea:~/FLAMEGPU2/build$ mpirun -n 2 bin/Debug/ensemble
CUDAEnsemble completed 100 runs successfully!
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200
Ensemble init: 450, calculated init 450
Ensemble result: 40244135200, calculated result 40244135200
To be discussed:
--no-mpi
? (Pete think it's redundant)Agreed
MPI file specific CI to test multiple MPI versions
I've created a new MPI workflow. However, as expected, it fails to install specific versions of mpich/openmpi via apt-get. Will wait for @ptheywood's return to advise on best method to make them available (my natural next step would be build from source).
add slurm script to docs.
I have semi-successfully ran the mpi test suite on Bede.
#!/bin/bash
# Generic options:
#SBATCH --account=bdshe03 # Run job under project <project>
#SBATCH --time=0:10:0 # Run for a max of 10 mins
# Node resources:
#SBATCH --partition=gpu # Choose either "gpu" or "infer" node type
#SBATCH --nodes=2 # Resources from a two nodes
#SBATCH --gres=gpu:1 # 1 GPUs per node
# Run commands:
# 1ppn == 1 process per node
bede-mpirun --bede-par 1ppn ./build/bin/Release/tests_mpi
Produces the intermingled log
Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[0;32m[==========] [mRunning 5 tests from 1 test suite.
[0;32m[----------] [mGlobal test environment set-up.
[0;32m[----------] [m5 tests from TestMPIEnsemble
[0;32m[ RUN ] [mTestMPIEnsemble.success
Running main() from /users/robadob/fgpu2/tests/helpers/main.cu
[0;32m[==========] [mRunning 5 tests from 1 test suite.
[0;32m[----------] [mGlobal test environment set-up.
[0;32m[----------] [m5 tests from TestMPIEnsemble
[0;32m[ RUN ] [mTestMPIEnsemble.success
[gpu030.bede.dur.ac.uk:696907] pml_ucx.c:291 Error: Failed to create UCP worker
[gpu031.bede.dur.ac.uk:2817740] pml_ucx.c:291 Error: Failed to create UCP worker
[0;32m[ OK ] [mTestMPIEnsemble.success (56576 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.success_verbose
[0;32m[ OK ] [mTestMPIEnsemble.success (56500 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.success_verbose
[0;32m[ OK ] [mTestMPIEnsemble.success_verbose (50446 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_off
[0;32m[ OK ] [mTestMPIEnsemble.success_verbose (50446 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_off
[0;32m[ OK ] [mTestMPIEnsemble.error_off (50467 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_slow
[0;32m[ OK ] [mTestMPIEnsemble.error_off (50467 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_slow
[0;32m[ OK ] [mTestMPIEnsemble.error_slow (50463 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_fast
[0;32m[ OK ] [mTestMPIEnsemble.error_slow (50462 ms)
[0;32m[ RUN ] [mTestMPIEnsemble.error_fast
[0;32m[ OK ] [mTestMPIEnsemble.error_fast (6057 ms)
[0;32m[----------] [m5 tests from TestMPIEnsemble (214011 ms total)
[0;32m[----------] [mGlobal test environment tear-down
[0;32m[==========] [m5 tests from 1 test suite ran. (214011 ms total)
[0;32m[ PASSED ] [m5 tests.
[0;32m[ OK ] [mTestMPIEnsemble.error_fast (6057 ms)
[0;32m[----------] [m5 tests from TestMPIEnsemble (213934 ms total)
[0;32m[----------] [mGlobal test environment tear-down
[0;32m[==========] [m5 tests from 1 test suite ran. (213934 ms total)
[0;32m[ PASSED ] [m5 tests.
Of note:
bede-mpirun
command which takes custom args along with mpirun args. Is this standard? It kind of harms our plan to provide a sample SLURM script. docs suggest the mpirun
equivalent is either -N
, -npernode
or --npernode
or simply -pernode
/--pernode
which is equivalent of npernode=1
.
--pernode
rather than --bede-par 1ppn
still reported the UCP failure. Additionally, the tests failed on one worker because the output stream was not empty. Nothing obvious in the log as to the cause, looks identical to previous runs beside the fail msg. Repeated this run, and that failure persisted with an additional failure (errcount 1, expected 0). Unclear if this is because mpirun
is being used rather than bede-mpirun
or what.Output from updated ensemble example using mpirun -n 2
on waimu (I didn't do any special hacks to make sure GPUs are unique but it still worked).
rob@waimea:~/FLAMEGPU2/build/bin/Debug$ mpirun -n 2 ./ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 218, calculated init 218
Ensemble result: 22162315712, calculated result 22162315712
Local MPI runner completed 51/100 runs.
Ensemble init: 232, calculated init 232
Ensemble result: 18081819488, calculated result 18081819488
Local MPI runner completed 49/100 runs.
mpirun
rather than bede-mpirun
make sense?Document how to run test suite with mpirun
Suggest to try with
mvapich2
from Bede docs.
When build with mvapich2
, and executed using bede-mpirun
this error is received (~3 attempts with slight changes).
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Google suggests it's an MPI misconfigured problem.
The mvapich2
mpirun
docs aren't great, so it doesn't have a convenient parameter like openmpi
for 1 process per node. Can't workout required commands to bypass bede-mpirun.
I've added a warning in CMake if using MPI and CMake < 3.20.1 to address #1114, which outputs the following.
CMake Warning at src/CMakeLists.txt:569 (message):
CMake < 3.20.1 may result in link errors with FLAMEGPU_ENABLE_MPI=ON for
some MPI installations. Consider using CMake >= 3.20.1 to avoid linker
errors.
Stanage znver3
(i.e. for the GPU nodes) include OpenMPI
, so we can use Stanage's A100 and H100 nodes for an x86_64 single-node upto 4 GPU MPI bench as it currently stands.
From an A100/H100 node, i.e. the following successfully conifgured and compiled from a H100 node:
module load OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CUDA/11.8.0 CMake/3.24.3-GCCcore-11.3.0
mkdir -p build-11-8-mpi
cd build-11-8-mpi
cmake .. -DCMAKe_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_BUILD_TESTS=ON -DFLAMEGPU_ENABLE_MPI=ON
cmake --build . --target tests_mpi -j `nproc`
I then ran the MPI test suite using a single rank, with a single GPU in my interactive session
mpirun -n 1 bin/Release/tests_mpi
This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?
Running main() from /users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN ] TestMPIEnsemble.success
[ OK ] TestMPIEnsemble.success (101671 ms)
[ RUN ] TestMPIEnsemble.success_verbose
[ OK ] TestMPIEnsemble.success_verbose (101131 ms)
[ RUN ] TestMPIEnsemble.error_off
[ OK ] TestMPIEnsemble.error_off (100341 ms)
[ RUN ] TestMPIEnsemble.error_slow
[ OK ] TestMPIEnsemble.error_slow (100338 ms)
[ RUN ] TestMPIEnsemble.error_fast
[ OK ] TestMPIEnsemble.error_fast (10333 ms)
[----------] 5 tests from TestMPIEnsemble (413816 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (413817 ms total)
[ PASSED ] 5 tests.
Trying to run with 4 processes on a single GPU and 20 CPU cores of an H100 node requires --oversubscribe
due to the current stanage configuration (only one mpi slot available), but that configuration skips a number of tests due to the stall, and then mpirun reports the error as the google test proccesses which skip return a non zero error code
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
/users/ABCDEFG/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:109: Skipped
Skipping single-node MPI test, world size (4) exceeds GPU count (1), this would cause test to stall.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[33200,1],0]
Exit code: 1
--------------------------------------------------------------------------
This ran successfully, but took quite a while. Might be worth toning theses tests down so they don't take as long when only using a single rank?
I think bede runs with 2 gpus were taking ~3 minutes, 1 gpu should be taking ~6 minutes.
Should be trivial to add a constant to scale the time I guess.
re: CI most MPI installs don't provide binary packages, and most linux distro's only package a single version of each MPI implementation.
I'm prototyping github action step(s) which install MPI from apt if specified or from source elsewhere (to avoid spamming long-running CI by pushing to this branch). Once it's sorted I'll add it to this branch.
MPI CI is pasing for all openmpi's, and MPICH's from source.
MPICH from apt is failing at link time. This appears to be related to -flto
which is enabled for host object compilation (and passed to the host compiler for cuda objects), but it is not being passed at link time?
This is the only build which is adding flto
, so presumably its an implicit option coming from the mpich installation somehow. I may be able to repro this locally?
Installing libmpich-dev
on my ubuntu 22.04 install reproduces the error.
The MPICH distributed via ubuntu / debian is the source of the lto flags, as shown by the following
$ mpicxx -compile-info
g++ -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -I/usr/include/x86_64-linux-gnu/mpich -L/usr/lib/x86_64-linux-gnu -lmpichcxx -lmpich
Not yet sure on how to resolve this in a way that will allow this build for end users.
Within my CMakeCache.txt
, these flags are in MPI_CXX_COMPILE_OPTIONS
, an advanced internal cache variable.
Potentially I could configure with this explicitly set to nothing, i.e. -DMPI_CXX_COMPILE_OPTIONS=""
which may work, but that might break other mpi installations so isn't something we'd want to programatically set. It might also still not result in usable binaries anyway if this is actaully required.
Edit: Configuring with -DMPI_CXX_COMPILE_OPTIONS=""
successfully builds with my local mpi build, and test_mpi
appears to run successfully (200s in so far)
MPI_CXX_COMPILE_OPTIONS:STRING=-flto=auto;-ffat-lto-objects;-flto=auto
I could probably detect this at configure time and warn/error about it in CMake?
Otherwise enabling LTO via CMake the "propper cmake" (INTERPROCEDURAL_OPTIMIZATION
) way might work (I'm skeptical), but might also enable device lto.
It is possible to detect and warn about -flto at configuration time, e.g. in src/CMakeLists.txt after the 3.20.1 warning:
# If the MPI installation brings in -flto (i.e. Ubuntu 22.04 libmpich-dev), warn about it and suggest a reconfiguration.
if(MPI_CXX_COMPILE_OPTIONS MATCHES ".*\-flto.*")
message(WARNING
" MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors. \n"
" Consider an alternate MPI implementation which does not embed -flto,\n"
" Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS=\"\" if linker error occur.")
endif()
$ cmake ..
-- -----Configuring Project: flamegpu-----
-- CUDA Architectures: 86
-- RapidJSON found. Headers: /home/ptheywood/code/flamegpu/FLAMEGPU2/build-mpi/_deps/rapidjson-src/include
-- flamegpu version 2.0.0-rc.1+eed4987d
CMake Warning at src/CMakeLists.txt:576 (message):
MPI_CXX_COMPILE_OPTIONS contains '-flto' which is likely to result in linker errors.
Consider an alternate MPI implementation which does not embed -flto,
Or reconfiguring CMake with -DMPI_CXX_COMPILE_OPTIONS="" if linker error occur.
This won't fix CI though.
I've been back through the changes since I last reviewed this, leaving comments as neccesary. Will aim to do some more testing of this on a multi-node system (Bede) tomorrow if possible, and get my head around the potential stall situations just to make sure I understand it and it's not a wider problem.
Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).
From the Todo list for this pr:
Should we expose world_size/rank to HostAPI? (I don't think it's necessary)
No, this is accessible by MPI for anyone that really wants / needs it
We should probably check this works from python too though prior to merge, and maybe add a test case for that.
Just a few more things that need small tweaks that hadn't been addressed yet otherwise (making the tests faster etc).
I think I've addressed all your points that came through my emails.
We should probably check this works from python too though prior to merge, and maybe add a test case for that.
I can probably get to that Friday.
testing on Stanage using an interactive 2 a100 session:
--oversubscribe
and the count is correct. Unclear which ranks use which GPUs however.mpirun -n 2
. It runs all tests twice independently, rather than running each test once but with 2 ranks. This might be a google test limitation, so 1 non google test test per binary and a ctest run might be the solution.mpiurun -n 1
with several test failuresCommit hash d529d6e81d7e6e4a7dfab56e4b235052dfc3700e
On a Stanage, using 1/2 of an a100 node (2 GPUs, 24 cores) interactively :
# Get an interactive session
srun --partition=gpu --qos=gpu --gres=gpu:a100:2 --mem=164G --cpus-per-task 24 --pty bash -i
Then in the interactive job (i.e. on an AMD cpu core, not the login intel CPU)
# Load Dependencies
module load CUDA/11.8.0 OpenMPI/4.1.4-GCC-11.3.0 GCC/11.3.0 CMake/3.24.3-GCCcore-11.3.0
# Build dir
mkdir -p build-mpi-cu118-gcc-113-ompi414
cd build-mpi-cu118-gcc-113-ompi414
# Configure
cmake .. -DCMAKE_CUDA_ARCHITECTURES="80;90" -DFLAMEGPU_ENABLE_MPI=ON -DFLAMEGPU_ENABLE_NVTX=ON -DFLAMEGPU_BUILD_TESTS=ON
# Compile
cmake --build . -j `nproc`
Running the ensemble example without explicitly using mpirun
fails with an MPI error. This could be a mpi configuration thing, but ideally the binary should be usable without MPI, and just not use MPI if not requested (I'm not 100% sure this is possible for all MPI applications, given mpirun in non standard, so detecting if mpi is requested might not be possible all the time)
$ ./bin/Release/ensemble
[gpu04.pri.stanage.alces.network:33719] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[gpu04.pri.stanage.alces.network:33719] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Using mpirun (or mpiexec) with a single rank runs successfully, but the completed message is missing a run (or more, seen 95 as well)
$ mpirun -n 1 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 441, calculated init 441
Ensemble result: 40144235200, calculated result 40144235200
Local MPI runner completed 99/100 runs.
Using 2 ranks (2 GPUs) errors with the stanage configuration from an interactive job, as mpi only beleives one slot is available. This might just be how I requsted the job. However using the suggested options it can be ran (see below)
$ mpirun -n 2 ./bin/Release/ensemble
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./bin/Release/ensemble
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
Using `--oversubscribe`` works, and the MPI runner counts add up.
$ mpirun --oversubscribe -n 2 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 213, calculated init 213
Ensemble result: 17859919488, calculated result 17859919488
Local MPI runner completed 45/100 runs.
Ensemble init: 237, calculated init 237
Ensemble result: 22384215712, calculated result 22384215712
Local MPI runner completed 55/100 runs.
Using 4 threads and --oversubscribe
also works, even though there are only 2 GPUs.
$ time mpiexec --oversubscribe -n 4 ./bin/Release/ensemble
CUDAEnsemble completed 100 runs successfully!
Ensemble init: 126, calculated init 128
Ensemble result: 11977340560, calculated result 11977540560
Local MPI runner completed 29/100 runs.
Ensemble init: 70, calculated init 69
Ensemble result: 6409861632, calculated result 6409761632
Local MPI runner completed 16/100 runs.
Ensemble init: 141, calculated init 140
Ensemble result: 11424857856, calculated result 11424757856
Local MPI runner completed 28/100 runs.
Ensemble init: 113, calculated init 113
Ensemble result: 10432075152, calculated result 10432075152
Local MPI runner completed 27/100 runs.
real 0m4.540s
user 0m2.145s
sys 0m16.240s
The test suite does not behave as intended when running with MPIrun, it just runs all the tests twice with one rank each, rather than running each test once, but using multiple processes. This might be a google test limitation, in which case we might need to make each mpi test it's own binary, and orchestrate them via ctest (we can use categories to make it easy to just run the mpi tests)
$ mpirun --oversubscribe -n 2 ./bin/Release/tests_mpi
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN ] TestMPIEnsemble.success
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN ] TestMPIEnsemble.success
[ OK ] TestMPIEnsemble.success (10767 ms)
[ RUN ] TestMPIEnsemble.success_verbose
[ OK ] TestMPIEnsemble.success (10772 ms)
[ RUN ] TestMPIEnsemble.success_verbose
[ OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN ] TestMPIEnsemble.error_off
[ OK ] TestMPIEnsemble.success_verbose (10029 ms)
[ RUN ] TestMPIEnsemble.error_off
[ OK ] TestMPIEnsemble.error_off (10030 ms)
[ RUN ] TestMPIEnsemble.error_slow
[ OK ] TestMPIEnsemble.error_off (10031 ms)
[ RUN ] TestMPIEnsemble.error_slow
[ OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN ] TestMPIEnsemble.error_fast
[ OK ] TestMPIEnsemble.error_slow (10030 ms)
[ RUN ] TestMPIEnsemble.error_fast
[ OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46879 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46879 ms total)
[ PASSED ] 5 tests.
[ OK ] TestMPIEnsemble.error_fast (6020 ms)
[----------] 5 tests from TestMPIEnsemble (46883 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (46883 ms total)
[ PASSED ] 5 tests.
Using a single rank test fail:
$ mpirun -n 1 ./bin/Release/tests_mpi
Running main() from /users/ac1phey/code/flamegpu/FLAMEGPU2/tests/helpers/main.cu
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestMPIEnsemble
[ RUN ] TestMPIEnsemble.success
[ OK ] TestMPIEnsemble.success (5951 ms)
[ RUN ] TestMPIEnsemble.success_verbose
[ OK ] TestMPIEnsemble.success_verbose (5105 ms)
[ RUN ] TestMPIEnsemble.error_off
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:201: Failure
Expected equality of these values:
err_count
Which is: 0
1u
Which is: 1
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:204: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
Actual: false
Expected: true
[ FAILED ] TestMPIEnsemble.error_off (5106 ms)
[ RUN ] TestMPIEnsemble.error_slow
/users/ac1phey/code/flamegpu/FLAMEGPU2/tests/test_cases/simulation/test_mpi_ensemble.cu:232: Failure
Value of: errors.find("Warning: Run 10 failed on rank ") != std::string::npos
Actual: false
Expected: true
[ FAILED ] TestMPIEnsemble.error_slow (5107 ms)
[ RUN ] TestMPIEnsemble.error_fast
[ FAILED ] TestMPIEnsemble.error_fast (5106 ms)
[----------] 5 tests from TestMPIEnsemble (26377 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (26377 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 3 tests, listed below:
[ FAILED ] TestMPIEnsemble.error_off
[ FAILED ] TestMPIEnsemble.error_slow
[ FAILED ] TestMPIEnsemble.error_fast
3 FAILED TESTS
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[56666,1],0]
Exit code: 1
--------------------------------------------------------------------------
Re: incorrect progress. I think this is a consequence of the reporting logic
const int progress = static_cast<int>(next_run) - static_cast<int>(TOTAL_RUNNERS * world_size);
It would never report higher than TOTAL_RUNS-TOTAL_RUNNERS
with a single MPI rank, as it's reporting the id of the newly assigned job - the number of runners. But will jump to exit early as soon as next_run
exceeds TOTAL_RUNS
.
Regardless, I've changed it so the printf
instead reports which job index has been assigned to which rank (and got rid of the \r
).
I've reworking progress printing, fixed bugs that were causing temperamental test failures, added a few extra tests for better coverage and added a single pyflamegpu capable MPI test (e.g. one that is very limited, but won't break if ran via MPI).
We've tested this on Bede again, using a few differnet MPI/GCC combos using 2 nodes, both the tests_mpi
and ensemble
targets
I'll try to get the other changes reviewed soon so we can merge this.
Code generally looks good now, still need to re-run things in a distributed setting etc.
One final idea for change, is to use MPI_Comm_split_type
within the cuda ensemble to auto-assign gpus for multiple ranks in a single multigpu node, much like how this is handled in the existing tests to support single-rank testing.
I.e.
This is mainly so that systems where it is not possible to request 1 mpi rank per node (or not easily) can be supported. Using 1 rank per node would still (likely) be the best performance.
(This would also make for an interesting benchmark / would let some benchmarking scenarios be covered).
I might do this myself (leaving the comment so I don't forget)
I've prototyped creating a new MPI communicator which will only invovle mpi ranks which have been assigned a GPU, to prevent errors if users request more mpi ranks per node than are available, with those ranks doing nothing.
E.g.
int max_participating_ranks_per_node = 3; // pretend each rank has 3 gpus / 3 mpi ranks participating
int color = local_rank < max_participating_ranks_per_node ? 0 : 1;
if (MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &comm_participating) != MPI_SUCCESS) {
fprintf(stderr, "Error creating communicator\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
int participating_size, participating_rank;
MPI_Comm_size(comm_participating, &participating_size);
MPI_Comm_rank(comm_participating, &participating_rank);
Launching with 4 ranks each on 2 nodes, pretending there are 3 gpus each then shows the correct comms.
main rank 0: recieving (6 - 1) messages?
messages from: 2, 1, 4, 5, 6,
MPI tests in debug builds are currently failing on mavericks due to captured output not matching what the test contains. This was the case prior to my device selection changes but hadn't been noticed (probably just hadn't been ran in debug mode for a while).
More ranks than GPUs ensemble example runs are working, but test suite is not behaving (I've likely broken assumptions in the tests now I've removed the check on the number of mpi ranks / device setting within the test suite)
Also not yet implemented what to do when devices are specified.
Release test suite all sorted now.
@Robadob - if you could skim the commits I've added to make sure you're happy enough with it that'd be appreciated. I've not touched the user-facing interface at all (other than changing implicit behaviour, that will need a tweak to the docs api).
Still need to:
conftest.py
so it doesn't telemetry from each rank, but that will be grim.Otherwise should be there now (I think I've got all my debug etc removed).
Test suite is fixed, python test suite is crashing in test_logging.py + maybe others for an mpi build. Had missed 2 printf's too.
doing a non-mpi build to figure out if its mpi build specific or not before attempting further debugging (I have a feeling there's a breaking change in here to do with a method return type which is probably the culprit).
Given the absence of a python_mpi tests sub suite, this might be a bit of effort. Might need to change conftest stuff for telemetry too.
5 ptest failures in the non-MPI build fails the python test suite with which need resolving.
2 have been fixed elsewhere, so just needs a rebase. Log related ones are an api break?
FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_function_condition - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/codegen/test_codegen_integration.py::GPUTest::test_gpu_codegen_simulation - AttributeError: '_SpecialForm' object has no attribute '__name__'
FAILED ../tests/python/io/test_logging.py::LoggingTest::test_CUDAEnsembleSimulate - AttributeError: 'int' object has no attribute 'getStepLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setExitLog - AttributeError: 'int' object has no attribute 'getExitLog'
FAILED ../tests/python/simulation/test_cuda_ensemble.py::TestCUDAEnsemble::test_setStepLog - AttributeError: 'int' object has no attribute 'getStepLog'
Tests now fixed by skipping, exposing pyflamegpu.MPI
indicating if MPI was enabled for the build or not.
Ran on bede using
Will remove the ternary and associated checks for it to be nullptr'd tomorrow, then should be goood to merge I think. (+ update the docs pr to reflect the change to the requirement of 1 rank per node, instead to be up to 1 rank per gpu (or a warning will be given)
Tweaked the final suggestions, so think this is good to go now.
Not sure Rob can review this PR though, so I will have to approve (or Paul)?
The implementation of MPI Ensembles within this PR is designed for each
CUDAEnsemble
to have exclusive access to all the GPUs available to it (or specified with thedevices
config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.
Full notes of identified edge cases are in the below todo list.
CUDAEnsemble
RunLog
, with those not handled empty of data, checking step counter > 0 is a hack, this presumably also affects failed runs under normal error modes)RunPlan
(it was initially assumed that vector would be parallel withRunPlanVector
used as input). [Agreed with Paul 2023-07-26]Rank 0 gets all the logs, others get emptyAll runners get all logsCatch and handle local MPI runs insideCUDAEnsemble
? (we want users to avoid using multiple MPI runners that have access to same GPU)Should we expose world_size/rank toHostAPI
? (I don't think it's necessary)CudaEnsemble::Config().mpi=false;
)Do we need to handle a race condition with RTC cache?Closes #1073 Closes #1114
Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of
CUDAEnsemble::getLogs
from astd::vector
to astd::map