[WIP] Add benchmark scripts

utkarsh530 commented 7 months ago

benchmark

Story: Async performs the best, as it is able to pack more work. But having no global updates, the loss reduction is less than the synchronized version and highly depends on the problem.

utkarsh530 commented 7 months ago

Speed-up: Upto 20x (Sync) Speed-up: Upto 250x (Async)

utkarsh530 commented 7 months ago

Benchmarking Hybrid is the tricky one, as there are multiple mem copies that need to be isolated from benchmarking.

utkarsh530 commented 7 months ago

Neural ODE:

Adam:


julia> @benchmark Optimization.solve(optprob, ADAM(0.05), maxiters = 100)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.545 s …   1.674 s  ┊ GC (min … max): 5.30% … 12.02%
 Time  (median):     1.604 s              ┊ GC (median):    8.22%
 Time  (mean ± σ):   1.607 s ± 60.158 ms  ┊ GC (mean ± σ):  8.51% ±  3.05%

  █         █                              █              █  
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.54 s         Histogram: frequency by time        1.67 s <

 Memory estimate: 353.97 MiB, allocs estimate: 4859457.

julia> @show res.objective
res.objective = 33.002594f0
33.002594f0

PSO:


julia> @benchmark PSOGPU.parameter_estim_ode!($prob_nn, $(deepcopy(solver_cache)), $lb, $ub; saveat = tsteps, dt = 0.1f0, prob_func = prob_func, maxiters = 100)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  492.334 ms … 655.447 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     545.838 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   567.069 ms ±  49.577 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █             █  █ ██             █      ██                 █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  492 ms           Histogram: frequency by time          655 ms <

 Memory estimate: 2.20 MiB, allocs estimate: 43106.
julia> @show gsol.cost
gsol.cost = 0.6644439f0
0.6644439f0

Neural_ODE

utkarsh530 commented 7 months ago

julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+530.30.2

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

7 devices:
  0: Tesla V100-PCIE-16GB (sm_70, 12.212 GiB / 16.000 GiB available)
  1: Tesla V100S-PCIE-32GB (sm_70, 31.405 GiB / 32.000 GiB available)
  2: Tesla V100S-PCIE-32GB (sm_70, 21.589 GiB / 32.000 GiB available)
  3: Tesla P100-PCIE-16GB (sm_60, 15.893 GiB / 16.000 GiB available)
  4: Tesla P100-PCIE-16GB (sm_60, 8.666 GiB / 16.000 GiB available)
  5: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.848 GiB / 11.000 GiB available)
  6: NVIDIA GeForce GTX 1080 Ti (sm_61, 3.770 GiB / 11.000 GiB available)

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
  Threads: 21 on 12 virtual cores
Environment:
  LD_LIBRARY_PATH = 
  JULIA_EDITOR = code

julia> device()
CuDevice(2): Tesla V100S-PCIE-32GB

SciML / PSOGPU.jl

[WIP] Add benchmark scripts #40