SciML / PSOGPU.jl

GPU accelerated Particle Swarm Optimization
MIT License
13 stars 1 forks source link

Start towards caching and perf optimizations #45

Closed utkarsh530 closed 5 months ago

utkarsh530 commented 6 months ago

Checklist

Additional context

Add any other context about the problem here.

utkarsh530 commented 6 months ago

Before:

julia> @benchmark sol = solve(prob,
           ParallelSyncPSOKernel(1000, backend = CUDA.CUDABackend()),
           maxiters = 500)
BenchmarkTools.Trial: 87 samples with 1 evaluation.
 Range (min … max):  49.353 ms … 187.970 ms  ┊ GC (min … max): 0.00% … 38.72%
 Time  (median):     53.320 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.881 ms ±  23.307 ms  ┊ GC (mean ± σ):  3.74% ±  6.52%

  ▇█▃                                                           
  ███▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃ ▁
  49.4 ms         Histogram: frequency by time          179 ms <

 Memory estimate: 3.00 MiB, allocs estimate: 71083.

After:

julia> @benchmark solve(prob,
           ParallelSyncPSOKernel(1000, backend = CUDA.CUDABackend()),
           maxiters = 500)
BenchmarkTools.Trial: 132 samples with 1 evaluation.
 Range (min … max):  34.288 ms … 166.877 ms  ┊ GC (min … max): 0.00% … 22.80%
 Time  (median):     34.847 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.072 ms ±  19.359 ms  ┊ GC (mean ± σ):  2.52% ±  3.89%

  █                                                             
  █▆▄▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄ ▄
  34.3 ms       Histogram: log(frequency) by time       166 ms <

 Memory estimate: 1.71 MiB, allocs estimate: 41525.
utkarsh530 commented 6 months ago

https://github.com/SciML/QuasiMonteCarlo.jl/issues/115

Generating samples from QMC is always allocating. Benchmarking without using QMC (using lb = nothing; ub = nothing):

Before:

julia> @benchmark sol = solve(prob, ParallelPSOKernel(1000, backend = CUDA.CUDABackend()))
BenchmarkTools.Trial: 274 samples with 1 evaluation.
 Range (min … max):  17.822 ms … 34.874 ms  ┊ GC (min … max): 0.00% … 45.44%
 Time  (median):     17.937 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.247 ms ±  2.227 ms  ┊ GC (mean ± σ):  1.56% ±  6.04%

  █                                                            
  █▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅
  17.8 ms      Histogram: log(frequency) by time      34.4 ms <

 Memory estimate: 1.63 MiB, allocs estimate: 35643.

After:

julia> @benchmark sol = solve(prob, ParallelPSOKernel(1000, backend = CUDA.CUDABackend()))
BenchmarkTools.Trial: 1376 samples with 1 evaluation.
 Range (min … max):  3.425 ms … 31.323 ms  ┊ GC (min … max): 0.00% … 77.35%
 Time  (median):     3.521 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.612 ms ±  1.283 ms  ┊ GC (mean ± σ):  1.68% ±  4.15%

                     ▃█▇▄▃▂▁ ▂ ▁                              
  ▂▂▂▂▂▂▂▂▂▁▂▃▂▃▃▃▃▅▇███████████▇▇▇▅▆▆▅▆▅▅▄▄▄▄▃▃▂▃▂▂▂▂▂▁▁▁▁▂ ▄
  3.43 ms        Histogram: frequency by time        3.64 ms <

 Memory estimate: 265.19 KiB, allocs estimate: 3106.

Note: This reduces the solve call time, i.e., in initialization, and there are no improvements in the exact GPU solve time.