Open vaiana opened 1 year ago
CPU comparisons are available in this private repo: https://github.com/agencyenterprise/ndk-research/blob/main/experiments/184354911-resources-benchmark-for-scenarios/184354911-resource_utilization_benchmark.ipynb
That data was used to estimate time/memory requirements for NDK users.
Might look into the GPU comparisons; would run into memory limits though.
As a preliminary CPU/GPU comparison, I ran docs/examples/plot_scenarios.py
on an AWS EC2 p3.2xlarge instance, which has an NVIDIA V100 GPU and 8 vCPUs. A more thorough investigation is in order, but the generated logs actually provide some good intuition about what kinds of speed-ups can be expected.
Stride's computations appear to take 3 major steps:
For a given scenario, the times seemed consistent (+/-5%
) across runs. For the CPU setup, I used Devito's automatic selection of # CPUs, which always came out to 4 threads.
Scenario [grid shape] | Step | CPU (4 threads) | GPU |
---|---|---|---|
scenario-0-v0 [101, 81] | Generate acoustic_iso_state operator |
10.03s | 10.39s |
JIT-compile C++ file | 3.64s | 4.3s | |
Run operator [11 GFlops] | 18 GFlops/s | 18 GFlops/s | |
Scenario-1-2d-v0 [241, 141] | Generate acoustic_iso_state operator |
9.76s | 10.8s |
JIT-compile C++ file | 3.24s | 4.38s | |
Run operator [42 Gflops] | 27 GFlops/s | 48 GFlops/s | |
scenario-2-2d-v0 [451, 351] | Generate acoustic_iso_state operator |
9.82s | 10.8s |
JIT-compile C++ file | 3.25s | 4.27s | |
Run operator [202 GFlops] | 35 GFlops/s | 109 GFlops/s | |
scenario-1-3d-v0 [241, 141, 141] | Generate acoustic_iso_state operator |
23.3s | 23.66s |
JIT-compile C++ file | 6.9s | 8.53s | |
Run operator [16 TFlops] | 21 GFlops/s | 610 GFlops/s |
From the table, we can see that:
gcc
was ~1 second faster than compiling for GPU with pgc++
. Compiling takes a similar amount of time for different 2-D scenarios.scenario-2-2d-v0
, a GPU provided a 3x speed up in the final step: running the operator.Raw logs:
env -u PLATFORM python docs/examples/plot_scenarios.py | tee cpu_log.txt
cpu_log.txt
PLATFORM=nvidia-acc python docs/examples/plot_scenarios.py | tee gpu_log.txt
gpu_log.txt
Note: I tried running the scenarios with different points_per_period
settings to change the time resolution. This linearly changed the number of GFlops
for the Run operator, did not seem to have any effect on the GFlops/s
. Seems like GPU only improves parallelization over the Space dimensions.
Describe the new feature or enhancement
It is possible to run NDK on GPU but its not clear how speed up we get (if any). GPU instances are higher cost so it would be nice to know if the extra cost is worthwhile.
Describe your proposed implementation
Add a
benchmark
directory, possibly intests/benchmark
with a script to time the execution of the scenarios under different parameters. The script should be platform/environment agnostic. We could then compile the benchmark statistics against a few different platforms (memory, gpu, cpus) to give users a sense of the run time. This would also improve the time estimate printed in the simulation.Additional comments
In very simple testing I found no speed up using GPU for the default settings of
scenario-1-2d-v0
, both ran in about 10s. There may be much more speed up when doing large simulations (3d?) but this is the type of thing that would be good to know ahead of time.