agencyenterprise / neurotechdevkit

Neurotech Development Kit (NDK)
https://agencyenterprise.github.io/neurotechdevkit/
Apache License 2.0
117 stars 10 forks source link

Add benchmarks for CPU/GPU #114

Open vaiana opened 1 year ago

vaiana commented 1 year ago

Describe the new feature or enhancement

It is possible to run NDK on GPU but its not clear how speed up we get (if any). GPU instances are higher cost so it would be nice to know if the extra cost is worthwhile.

Describe your proposed implementation

Add a benchmark directory, possibly in tests/benchmark with a script to time the execution of the scenarios under different parameters. The script should be platform/environment agnostic. We could then compile the benchmark statistics against a few different platforms (memory, gpu, cpus) to give users a sense of the run time. This would also improve the time estimate printed in the simulation.

Additional comments

In very simple testing I found no speed up using GPU for the default settings of scenario-1-2d-v0, both ran in about 10s. There may be much more speed up when doing large simulations (3d?) but this is the type of thing that would be good to know ahead of time.

charlesbmi commented 1 year ago

CPU comparisons are available in this private repo: https://github.com/agencyenterprise/ndk-research/blob/main/experiments/184354911-resources-benchmark-for-scenarios/184354911-resource_utilization_benchmark.ipynb

That data was used to estimate time/memory requirements for NDK users.

Might look into the GPU comparisons; would run into memory limits though.

charlesbmi commented 1 year ago

As a preliminary CPU/GPU comparison, I ran docs/examples/plot_scenarios.py on an AWS EC2 p3.2xlarge instance, which has an NVIDIA V100 GPU and 8 vCPUs. A more thorough investigation is in order, but the generated logs actually provide some good intuition about what kinds of speed-ups can be expected.

Stride's computations appear to take 3 major steps:

  1. generating the C++ code for the given scenario
  2. Compiling the generated C++ code. This can be cached.
  3. Running the code/operator.

For a given scenario, the times seemed consistent (+/-5%) across runs. For the CPU setup, I used Devito's automatic selection of # CPUs, which always came out to 4 threads.

Scenario [grid shape] Step CPU (4 threads) GPU
scenario-0-v0 [101, 81] Generate acoustic_iso_state operator 10.03s 10.39s
JIT-compile C++ file 3.64s 4.3s
Run operator [11 GFlops] 18 GFlops/s 18 GFlops/s
Scenario-1-2d-v0 [241, 141] Generate acoustic_iso_state operator 9.76s 10.8s
JIT-compile C++ file 3.24s 4.38s
Run operator [42 Gflops] 27 GFlops/s 48 GFlops/s
scenario-2-2d-v0 [451, 351] Generate acoustic_iso_state operator 9.82s 10.8s
JIT-compile C++ file 3.25s 4.27s
Run operator [202 GFlops] 35 GFlops/s 109 GFlops/s
scenario-1-3d-v0 [241, 141, 141] Generate acoustic_iso_state operator 23.3s 23.66s
JIT-compile C++ file 6.9s 8.53s
Run operator [16 TFlops] 21 GFlops/s 610 GFlops/s

From the table, we can see that:

Raw logs: env -u PLATFORM python docs/examples/plot_scenarios.py | tee cpu_log.txt cpu_log.txt

PLATFORM=nvidia-acc python docs/examples/plot_scenarios.py | tee gpu_log.txt gpu_log.txt

charlesbmi commented 1 year ago

Note: I tried running the scenarios with different points_per_period settings to change the time resolution. This linearly changed the number of GFlops for the Run operator, did not seem to have any effect on the GFlops/s. Seems like GPU only improves parallelization over the Space dimensions.