Open ptheywood opened 3 years ago
Is there any reason we can't just have python and c++ in the same docker image? How much additional size is it really going to add?
I'm doing a benchmark to compare FLAME GPU 2 with mesa-frames (cc: @adamamer20), and have prepared a Dockerfile for it. I will turn it into a PR when I have the time.
This was generated by GPT-4o and then I fixed several missing dependencies.
# Use an official CUDA base image
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
# Install dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
python3 \
python3-pip \
libgl1-mesa-dev \
libglew-dev \
freeglut3-dev \
xorg-dev \
swig \
patchelf \
curl \
&& rm -rf /var/lib/apt/lists/*
# Set up Python environment
RUN python3 -m pip install --upgrade pip setuptools wheel build
# Clone the FLAMEGPU2 repository
RUN git clone https://github.com/FLAMEGPU/FLAMEGPU2.git /flamegpu2
# Set the working directory
WORKDIR /flamegpu2
# Checkout the desired branch (e.g., master or a specific version)
RUN git checkout master
# Create and build the project using CMake
RUN mkdir -p build && cd build && \
cmake .. -DCMAKE_BUILD_TYPE=Release -DFLAMEGPU_BUILD_PYTHON=ON && \
cmake --build . --target flamegpu boids_bruteforce -j 8
# Optional: Install Python bindings (if needed)
# RUN cd build && make install && pip3 install ./lib/python
# Set up entry point (if required)
ENTRYPOINT ["tail", "-f", "/dev/null"]
One thing I am not sure with the cmake
command is that, is -DCMAKE_CUDA_ARCHITECTURES=61
necessary to optimize the result further? Am I missing a lot if I compile without specific target architecture?
This is what I get on a V100 NVIDIA. The result doesn't make sense to me. Why is it almost the same for 1 million and 16 million agents?
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.543342,6158.32
0,512,262144,0.17,0.537335,25022.8
0,768,589824,0.17,0.537974,57078.9
0,1024,1048576,0.17,0.53916,101629
0,1280,1638400,0.17,0.541711,159002
0,1536,2359296,0.17,0.546862,230722
0,1792,3211264,0.17,0.549065,314442
0,2048,4194304,0.17,0.552074,409362
0,2304,5308416,0.17,0.56062,518971
0,2560,6553600,0.17,0.567626,640464
0,2816,7929856,0.17,0.570724,774681
0,3072,9437184,0.17,0.580403,922129
0,3328,11075584,0.17,0.57991,1.08285e+06
0,3584,12845056,0.17,0.588218,1.25571e+06
0,3840,14745600,0.17,0.588479,1.44259e+06
0,4096,16777216,0.17,0.59788,1.64187e+06
1,256,65536,0.17,0.553675,6294.4
1,512,262144,0.17,0.554435,25029.3
1,768,589824,0.17,0.555797,57294.6
1,1024,1048576,0.17,0.555232,101536
1,1280,1638400,0.17,0.557549,160115
1,1536,2359296,0.17,0.56029,231049
1,1792,3211264,0.17,0.560761,314853
1,2048,4194304,0.17,0.563404,410717
1,2304,5308416,0.17,0.567223,520563
1,2560,6553600,0.17,0.569116,641398
1,2816,7929856,0.17,0.57899,776075
1,3072,9437184,0.17,0.577088,925085
# Create and build the project using CMake RUN mkdir -p build && cd build && \ cmake .. -DCMAKE_BUILD_TYPE=Release -DFLAMEGPU_BUILD_PYTHON=ON && \ cmake --build . --target flamegpu boids_bruteforce -j 8
For peak performance/benchmarking 'seatbelts' (runtime error checking) should also be disabled -DFLAMEGPU_SEATBELTS=OFF
This is what I get on a V100 NVIDIA. The result doesn't make sense to me. Why is it almost the same for 1 million and 16 million agents?
@ptheywood can probably advise, but that table of results does seem sus at a glance, especially if it's a brute force model.
Sorry for not stating which benchmark I did. It was the Sugarscape IG to reproduce the paper result.
is -DCMAKE_CUDA_ARCHITECTURES=61 necessary to optimize the result further?
For V100, I think that should actually be 70
, 61 is pascal.
Grand scheme, so long as it's less than or equal to (and compiles/runs) it should be fine. Although newer is typically preferred. Similarly, you can also build for multiple architectures, and CUDA should pick the newest at runtime. Although that inflates the binary size and compile time as all device code is duplicated for each architecture requested.
I've seen performance both get better and get worse when compiling for earlier CUDA architectures. All it does is enables/disables whether the compiler utilises certain architectural features. So if the code includes hyper modern functions, it might not build if compiled for earlier architectures. But typical CUDA code is just going to see some statements compiled to slightly different instructions, which may be faster or may be slower (it depends on alot of unknown variables).
I see, thank you for the informative architecture configuration.
Update: I ran cmake
with -DCMAKE_CUDA_ARCHITECTURES=70
and -DFLAMEGPU_SEATBELTS=OFF
, and I still got ~0.5s per step.
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.557127,6158.32
0,512,262144,0.17,0.550818,25022.8
0,768,589824,0.17,0.554487,57078.9
0,1024,1048576,0.17,0.555948,101629
My lshw -C display
output
description: 3D controller
product: GV100GL [Tesla V100 PCIe 16GB]
vendor: NVIDIA Corporation
Update: I ran cmake with -DCMAKE_CUDA_ARCHITECTURES=70 and -DFLAMEGPU_SEATBELTS=OFF, and I still got ~0.5s per step.
I did both rebuilding the Docker image so that the FLAME GPU 2 is compiled with this flag, and also the Sugarscape IG benchmark.
Currently the main FLAME GPU repo isn't really set up for installation targets and then finding by cmake which would make a generic flamegpu2 dockerfile useful (although it could be used to package pyflamegpu, as an alternative to installing from our pip wheelhouse).
I.e. the benchamrk repos will all fetch their own version of flamegpu during configuration and build it at runtime.
This would need #260 to be worthwhile, although as you've probably found it is useful for installing dependencies.
From the dockerfile you've included above, it looks sensible enough other than the last two segments.
# Optional: Install Python bindings (if needed)
# RUN cd build && make install && pip3 install ./lib/python
There is no install target, and I think the pip install statment would need tweaking.
# Set up entry point (if required)
ENTRYPOINT ["tail", "-f", "/dev/null"]
This entrypoint does nothing useful, it's just GPT-4o regurgitating things it doesn't understand.
-DCMAKE_CUDA_ARCHITECTURES
generates CUDA code which is compatible with specific GPU architectures. If unspecified FLAME GPU will build for all major architectures supported by the CUDA version. For CUDA 11.8 this would 35;50;60;70;80;90
(Kepler to Hopper, although full hopper support requires CUDA 12.0).
By specifying a single value, the compialtion time and binary file size would be reduced, but restrict the GPUs which can run the code to ones newer than specified (and the first run will JIT embedded PTX for newer architectures). I.e. specifyng 70
would allow volta and newer GPUS to run, but features only in Ampere and Hopper would not be used.
For the FLAMEGPU2-submodel-benchmark
performance, I've done a fresh native build on our titan v machine
$ module load CUDA/11.8 # specific to the machine
$ cmake -B build-11.8 -DCMAKE_CUDA_ARCHITECTURES=70 -DFLAMEGPU_SEATBELTS=OFF
$ cmake --build build-11.8/ -j 8
$ cd build-11.8/
$ ./bin/Release/submodel-benchmark
I let this run for a few simulations from the performance_Scaling bench
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.0933171,6158.32
0,512,262144,0.17,0.0923017,25022.8
0,768,589824,0.17,0.0933319,57078.9
0,1024,1048576,0.17,0.0950031,101629
0,1280,1638400,0.17,0.095181,159002
0,1536,2359296,0.17,0.0986446,230722
0,1792,3211264,0.17,0.099533,314442
and also re-ran the existing binary file on our HPC machine with V100s, using the existing binary file which was compield using CUDA 11.0 and GCC 9.
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000485079,6158.32
0,512,262144,0.17,0.00095145,25022.8
0,768,589824,0.17,0.00153795,57078.9
0,1024,1048576,0.17,0.00252234,101629
0,1280,1638400,0.17,0.00358018,159002
0,1536,2359296,0.17,0.0053722,230722
0,1792,3211264,0.17,0.00701561,314442
0,2048,4194304,0.17,0.00886571,409362
and then with a clean build shich shows similar performance
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000434862,6158.32
0,512,262144,0.17,0.000939674,25022.8
0,768,589824,0.17,0.00153737,57078.9
0,1024,1048576,0.17,0.00262005,101629
0,1280,1638400,0.17,0.00359381,159002
0,1536,2359296,0.17,0.0053471,230722
The difference between our V100 and Titan Vs is much larger than I'd expected, although a chunk of that may be in the slightly different drivers and presence of an X server in our Titan machine.
Some of it could be due to power state, but I'd have only expected that to be a penalty for the first simulation at most.
I've tweaked the benchmark repo to only run a single simulation before completing, and enabled NVTX in flamegpu via -DFLAMEGPU_ENABLE_NVTX=ON
at configure time. This being the first sim it is a very small population so the amount of GPU time will be relaitvley low.
For the V100 this produces a sensible, very short timeline, with most simualtion step nvtx ranges taking ~600us which lines up with the reported time from this run (nsys does add some overhead, and the first few steps are slower due to how the model behaves)
0,256,65536,0.17,0.000696624,6158.32
However my Titan V machine took ~120ms per step, with most the time spent not doing any GPU work, but being blocked by a system
entry. The actual portion of time during that 120ms where the GPU was doing somethiing was ~1.4ms.
The output time for the Titan V run was the 120ms, which lines up with the output time during profiling (and without profiling was ~99ms).
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.125717,6158.32
The above screenshot shows the timeline for the final step of the simulation, with vastly differing timelines shown between the V100 at the top (~640us) and Titan V below (~120ms).
This is something we need to dig into more, to understand why this is happening on our Titan V machines, see if it is impacting any of our other non HPC machines.
It also might not be the same thing that is impacting your machine.
After noticing that the ensemble benchmark was still using FLAME GPU 2.0.0-rc
, rather than the much more recent 2.0.0-rc.1
or current master
, I thought I'd see if this was caused by a bug we'd previosuly fixed but forgotten about / appeared unrelated
I did this by changing the appropraite line of changing CMakeLists.txt
from
set(FLAMEGPU_VERSION "v2.0.0-rc" CACHE STRING "FLAMEGPU/FLAMEGPU2 git branch or tag to use")
to
set(FLAMEGPU_VERSION "v2.0.0-rc.1" CACHE STRING "FLAMEGPU/FLAMEGPU2 git branch or tag to use")
and configuring a fresh cmake build, using teh saem CUDA 11.0 and GCC 9 as before on our titan V machine has reduced the nvtx trace to showthe steps taking ~580us. rather than 125ms.
And without profiling now reports much more sensible timings for out titan V machine.
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.0004592,6157.67
0,512,262144,0.17,0.00103467,25034.7
0,768,589824,0.17,0.0018241,57058.4
0,1024,1048576,0.17,0.00307118,101641
0,1280,1638400,0.17,0.0044578,158976
Having looked at the changelog, I've narrowed down the cause to a bug in our telemetry https://github.com/FLAMEGPU/FLAMEGPU2/issues/1079, which was submitting a telemetry package every time a submodel finished. I.e. each step was making a network request, which takes roughly the same duration each step, hence no apparent scaling when the step duration is negligable compared to a network request.
This bug has been fixed in the main FLAMEGPU 2 repository, but the stanadalone benchmark repo which we haven't updated.
For the results in the paper generated with rc0, I disabled the telemetery on our HPC system when running the benchmarks by configuring with -DFLAMEGPU_SHARE_USAGE_STATISTICS=OFF
, sop they didn't exhibit this problem.
@rht could you try again doing one of the following to see if it resolves the issue for you as well:
FLAMEGPU_SHARE_USAGE_STATISTICS
set to OFF
FLAMEGPU_SHARE_USAGE_STATISTICS=OFF ./bin/Release/submodel-benchmark
-DFLAMEGPU_SHARE_USAGE_STATISTICS=OFF
, recompile and rerunFLAMEGPU2-submodel-benchmark/CMakeListst.txt
to 2.0.0-rc.1
as above, reconfigure (maybe with an entirely fresh build directory) and re-run.
Thank you for finding the cause!
Run with the environment variable FLAMEGPU_SHARE_USAGE_STATISTICS set to OFF i.e. FLAMEGPU_SHARE_USAGE_STATISTICS=OFF ./bin/Release/submodel-benchmark
This one works! I see, I assume any of the 3 has the same effect, with the last one, 2.0.0-rc.1, which disables the statistics by default?
It's almost the same as the V100 result in your machine.
repetition,grid_width,pop_size,p_occupation,s_step_mean,pop_count_mean
0,256,65536,0.17,0.000483422,6158.32
0,512,262144,0.17,0.000958801,25022.8
0,768,589824,0.17,0.00177816,57078.9
0,1024,1048576,0.17,0.00265805,101629
0,1280,1638400,0.17,0.0037753,159002
0,1536,2359296,0.17,0.00545013,230722
0,1792,3211264,0.17,0.00723409,314442
0,2048,4194304,0.17,0.00905549,409362
0,2304,5308416,0.17,0.0116754,518971
0,2560,6553600,0.17,0.0145468,640464
For context, @adamamer20 is trying to do fast vectorized ABM using pandas/Polars: https://github.com/adamamer20/mesa-frames/pull/71. It's not yet using GPU, but GPU-based DF is in the work.
This one works! I see, I assume any of the 3 has the same effect, with the last one, 2.0.0-rc.1, which disables the statistics by default?
The environment variable prevents executables with telemetry enabled from submitting telemetry at runtime.
The Cmake configuration option prevents telementry from being embedded in the binary at all (so you don't need to remember to set the environment variable
The update to 2.0.0-rc.1 fixes a bug which caused a telemetry packet to be emitted for every step of the submodel benchmark (when a simulation completes it submits a telemetry packet, but that included submodels which are ran many times by the parent model, i.e. 100 times more than intended in the submodel benchmark model which runs for 100 steps for the performance test).
We probably want to provide a docker image(s) as one option for running flamegpu
Based on other projects, we probably want to provide:
docker/cuda/Dockerfile
(or something to identify it as the c++ verison?)docker/python3/Dockerfile
We will have to base these on Nvida dockerfiles to comply with the redisitribution licence of libcuda.so.
It might aslo be worth separating images for using FLAME GPU from the image, and being able to modify FLAME GPU in the iamges. i.e. provide
-dev
images which include all source and build artifacts, and other images which just contain the CUDA/C++ static lib and includes and / or a docker image with a python wheel already installed.There will likely be some limitations for visualisaiton via docker. I.e. the nvidia docker container runtime suggests that GLX is not available, and EGL must be used instead (source)