SciML / DiffEqGPU.jl

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
https://docs.sciml.ai/DiffEqGPU/stable/
MIT License
279 stars 29 forks source link

Enhancement and Fixes in GPU Tsit5 #170

Closed utkarsh530 closed 2 years ago

utkarsh530 commented 2 years ago

Hi,

Wrt the previous discussion on speed-up issues with DiffEqGPU API with the new GPU solvers, I set to bridge the performance gap (46 ¡s to 10.5 ms 😱) between raw GPU solutions and using the Ensemble DiffEqGPU API. A small commit fix improved the speed by ~ 2.4 ms. (The u0 and p hcat was only needed by previous solving techniques). Pros: It is now faster than EnsembleGPUArray and EnsembleCPUArray.

Before:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)

BenchmarkTools.Trial: 386 samples with 1 evaluation.
 Range (min … max):   8.805 ms … 40.638 ms  β”Š GC (min … max): 0.00% … 23.29%
 Time  (median):     10.538 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.962 ms Β±  5.690 ms  β”Š GC (mean Β± Οƒ):  8.49% Β± 10.88%

  β–„β–‚β–β–„β–ˆβ–„β–‚ ▁                                    ▁               
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–„β–‡β–‡β–‡β–„β–„β–„β–†β–β–…β–β–β–β–β–…β–„β–β–…β–„β–β–β–β–β–β–„β–†β–…β–β–„β–…β–†β–‡β–ˆβ–ˆβ–…β–…β–†β–β–β–„β–β–β–„β–…β–β–… β–†
  8.81 ms      Histogram: log(frequency) by time      30.3 ms <

 Memory estimate: 13.86 MiB, allocs estimate: 88075.

After:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)
BenchmarkTools.Trial: 497 samples with 1 evaluation.
 Range (min … max):   7.095 ms … 74.585 ms  β”Š GC (min … max): 0.00% … 24.81%
 Time  (median):      8.108 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.051 ms Β±  7.866 ms  β”Š GC (mean Β± Οƒ):  8.66% Β±  9.01%

  β–‚β–ˆβ–… ▁                                                        
  β–ˆβ–ˆβ–ˆβ–„β–ˆβ–…β–β–β–β–„β–β–β–β–β–β–β–β–„β–β–„β–β–β–β–β–β–β–„β–β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–„β–β–„β–…β–β–‡β–‡ β–†
  7.09 ms      Histogram: log(frequency) by time      43.3 ms <

 Memory estimate: 5.88 MiB, allocs estimate: 76811.

With some profiling and tracking allocations (attached in the next comment), I was able to lower these issues:

@ChrisRackauckas, have a look.

utkarsh530 commented 2 years ago
Screenshot 2022-07-28 at 12 32 07 AM

I am not able to upload .mem file here, I can share it on slack, maybe.

utkarsh530 commented 2 years ago

Do we need these assertion checks here?: https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L311-L312 They cause allocations.

Screenshot 2022-07-28 at 1 08 05 AM
ChrisRackauckas commented 2 years ago

Why does that allocate? Is there a way using Fix2 to make it not allocate?

utkarsh530 commented 2 years ago

Made a fix with Fix2. It is now reduced to 2 allocations.

Some speed-ups as well:

Before:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)
BenchmarkTools.Trial: 497 samples with 1 evaluation.
 Range (min … max):   7.095 ms … 74.585 ms  β”Š GC (min … max): 0.00% … 24.81%
 Time  (median):      8.108 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.051 ms Β±  7.866 ms  β”Š GC (mean Β± Οƒ):  8.66% Β±  9.01%

  β–‚β–ˆβ–… ▁                                                        
  β–ˆβ–ˆβ–ˆβ–„β–ˆβ–…β–β–β–β–„β–β–β–β–β–β–β–β–„β–β–„β–β–β–β–β–β–β–„β–β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–„β–β–„β–…β–β–‡β–‡ β–†
  7.09 ms      Histogram: log(frequency) by time      43.3 ms <

 Memory estimate: 5.88 MiB, allocs estimate: 76811.

After:

julia> @benchmark solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                   adaptive = false, dt = 0.1f0)
BenchmarkTools.Trial: 3789 samples with 1 evaluation.
 Range (min … max):  802.894 ΞΌs … 44.251 ms  β”Š GC (min … max):  0.00% … 61.15%
 Time  (median):     875.493 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.311 ms Β±  2.630 ms  β”Š GC (mean Β± Οƒ):  26.85% Β± 14.05%

  β–ˆβ–„β–‚β–                                                          
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–ˆβ–‡β–†β–„β–…β–β–„ β–ˆ
  803 ΞΌs        Histogram: log(frequency) by time      12.7 ms <

 Memory estimate: 3.86 MiB, allocs estimate: 34400.

We are now 43x faster than EnsembleGPUArray and 14x faster than EnsembleCPUArray . Moreover, now the current implementation supports parameter parallelism across u0 as well instead of p.

utkarsh530 commented 2 years ago

It might be good to know the results with 10000 trajectories:

julia> @btime sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(0.0),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  5.381 ms (138023 allocations: 8.86 MiB)

julia> @btime sol = solve(monteprob, Tsit5(), EnsembleCPUArray(),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  719.027 ms (344424 allocations: 1.14 GiB)

julia> @btime sol = solve(monteprob, Tsit5(), EnsembleGPUArray(0.0),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  166.850 ms (341201 allocations: 1.14 GiB)

Speed-ups: CPUArray: 135x GPUArray: 31.4x

Although I am not sure how practical is it to use 10,000 trajectories πŸ˜….

ChrisRackauckas commented 2 years ago

And what's the overhead vs the most direct form, and the current reason for the overhead?

utkarsh530 commented 2 years ago

And what's the overhead vs. the most direct form, and the current reason for the overhead?

The major overhead is converting back to CPU Arrays and finally building the solution (~60-70%). The probs creation as well (~20%)

Screenshot 2022-08-02 at 12 09 10 AM
utkarsh530 commented 2 years ago

Is it okay to merge?

ChrisRackauckas commented 2 years ago

Answer the remaining question.

utkarsh530 commented 2 years ago

Cross-posted from slack:

And what’s the overhead vs. the most direct form, and the current reason for the overhead?

The reason for this overhead (converting to CPU Arrays) is to provide users to access something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us which are CUArrays. I wanted to ask whether should we leave it user to convert to CPU Arrays? I missed that thing to discuss. The probs creation seems to be necessary, but maybe could be pulled out of DiffEqGPU? Currently, it was done to adhere to DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve . And sorry for some sloppiness which may had caused any discomfort.

ChrisRackauckas commented 2 years ago

no

what does cu(probs) actually do? What dispatch is that hitting?