Add benchmarks for CPU/GPU

vaiana commented 1 year ago

Describe the new feature or enhancement

It is possible to run NDK on GPU but its not clear how speed up we get (if any). GPU instances are higher cost so it would be nice to know if the extra cost is worthwhile.

Describe your proposed implementation

Add a benchmark directory, possibly in tests/benchmark with a script to time the execution of the scenarios under different parameters. The script should be platform/environment agnostic. We could then compile the benchmark statistics against a few different platforms (memory, gpu, cpus) to give users a sense of the run time. This would also improve the time estimate printed in the simulation.

Additional comments

In very simple testing I found no speed up using GPU for the default settings of scenario-1-2d-v0, both ran in about 10s. There may be much more speed up when doing large simulations (3d?) but this is the type of thing that would be good to know ahead of time.

charlesbmi commented 1 year ago

CPU comparisons are available in this private repo: https://github.com/agencyenterprise/ndk-research/blob/main/experiments/184354911-resources-benchmark-for-scenarios/184354911-resource_utilization_benchmark.ipynb

That data was used to estimate time/memory requirements for NDK users.

Might look into the GPU comparisons; would run into memory limits though.

charlesbmi commented 1 year ago

As a preliminary CPU/GPU comparison, I ran docs/examples/plot_scenarios.py on an AWS EC2 p3.2xlarge instance, which has an NVIDIA V100 GPU and 8 vCPUs. A more thorough investigation is in order, but the generated logs actually provide some good intuition about what kinds of speed-ups can be expected.

Stride's computations appear to take 3 major steps:

generating the C++ code for the given scenario
Compiling the generated C++ code. This can be cached.
Running the code/operator.

For a given scenario, the times seemed consistent (+/-5%) across runs. For the CPU setup, I used Devito's automatic selection of # CPUs, which always came out to 4 threads.

Scenario [grid shape]	Step	CPU (4 threads)	GPU
scenario-0-v0 [101, 81]	Generate `acoustic_iso_state` operator	10.03s	10.39s
	JIT-compile C++ file	3.64s	4.3s
	Run operator [11 GFlops]	18 GFlops/s	18 GFlops/s
Scenario-1-2d-v0 [241, 141]	Generate `acoustic_iso_state` operator	9.76s	10.8s
	JIT-compile C++ file	3.24s	4.38s
	Run operator [42 Gflops]	27 GFlops/s	48 GFlops/s
scenario-2-2d-v0 [451, 351]	Generate `acoustic_iso_state` operator	9.82s	10.8s
	JIT-compile C++ file	3.25s	4.27s
	Run operator [202 GFlops]	35 GFlops/s	109 GFlops/s
scenario-1-3d-v0 [241, 141, 141]	Generate `acoustic_iso_state` operator	23.3s	23.66s
	JIT-compile C++ file	6.9s	8.53s
	Run operator [16 TFlops]	21 GFlops/s	610 GFlops/s

From the table, we can see that:

Generating the C++ code is similar between CPU/GPU setups and between different 2-D scenarios.
Compiling the C++ code for CPU with gcc was ~1 second faster than compiling for GPU with pgc++. Compiling takes a similar amount of time for different 2-D scenarios.
For the largest 2D example scenario, scenario-2-2d-v0, a GPU provided a 3x speed up in the final step: running the operator.
The larger the spatial grid, the larger the GPU performance boost over CPU. GPU provides a huge boost for 3D simulations.

Raw logs: env -u PLATFORM python docs/examples/plot_scenarios.py | tee cpu_log.txt cpu_log.txt

PLATFORM=nvidia-acc python docs/examples/plot_scenarios.py | tee gpu_log.txt gpu_log.txt

charlesbmi commented 1 year ago

Note: I tried running the scenarios with different points_per_period settings to change the time resolution. This linearly changed the number of GFlops for the Run operator, did not seem to have any effect on the GFlops/s. Seems like GPU only improves parallelization over the Space dimensions.

agencyenterprise / neurotechdevkit