Enhancement and Fixes in GPU Tsit5

utkarsh530 commented 2 years ago

Hi,

Wrt the previous discussion on speed-up issues with DiffEqGPU API with the new GPU solvers, I set to bridge the performance gap (46 µs to 10.5 ms 😱) between raw GPU solutions and using the Ensemble DiffEqGPU API. A small commit fix improved the speed by ~ 2.4 ms. (The u0 and p hcat was only needed by previous solving techniques). Pros: It is now faster than EnsembleGPUArray and EnsembleCPUArray.

Before:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)

BenchmarkTools.Trial: 386 samples with 1 evaluation.
 Range (min … max):   8.805 ms … 40.638 ms  ┊ GC (min … max): 0.00% … 23.29%
 Time  (median):     10.538 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.962 ms ±  5.690 ms  ┊ GC (mean ± σ):  8.49% ± 10.88%

  ▄▂▁▄█▄▂ ▁                                    ▁               
  █████████▇▇▇▇▄▇▇▇▄▄▄▆▁▅▁▁▁▁▅▄▁▅▄▁▁▁▁▁▄▆▅▁▄▅▆▇██▅▅▆▁▁▄▁▁▄▅▁▅ ▆
  8.81 ms      Histogram: log(frequency) by time      30.3 ms <

 Memory estimate: 13.86 MiB, allocs estimate: 88075.

After:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)
BenchmarkTools.Trial: 497 samples with 1 evaluation.
 Range (min … max):   7.095 ms … 74.585 ms  ┊ GC (min … max): 0.00% … 24.81%
 Time  (median):      8.108 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.051 ms ±  7.866 ms  ┊ GC (mean ± σ):  8.66% ±  9.01%

  ▂█▅ ▁                                                        
  ███▄█▅▁▁▁▄▁▁▁▁▁▁▁▄▁▄▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▄▁▄▅▁▇▇ ▆
  7.09 ms      Histogram: log(frequency) by time      43.3 ms <

 Memory estimate: 5.88 MiB, allocs estimate: 76811.

With some profiling and tracking allocations (attached in the next comment), I was able to lower these issues:

[x] remove u0 and p build from EnsembleGPUAutonomous
[x] Search for better alternatives for building u0 and p in EnsembleGPUArray
[x] Check for @assert necessity
[ ] Transforming GPU us arrays to CPU arrays
[x] Possibility of building Vector{ODESolution} with better speed and lower allocations

@ChrisRackauckas, have a look.

utkarsh530 commented 2 years ago

I am not able to upload .mem file here, I can share it on slack, maybe.

utkarsh530 commented 2 years ago

Do we need these assertion checks here?: https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L311-L312 They cause allocations.

ChrisRackauckas commented 2 years ago

Why does that allocate? Is there a way using Fix2 to make it not allocate?

utkarsh530 commented 2 years ago

Made a fix with Fix2. It is now reduced to 2 allocations.

Some speed-ups as well:

Before:

julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                              adaptive = false, dt = dt)
BenchmarkTools.Trial: 497 samples with 1 evaluation.
 Range (min … max):   7.095 ms … 74.585 ms  ┊ GC (min … max): 0.00% … 24.81%
 Time  (median):      8.108 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.051 ms ±  7.866 ms  ┊ GC (mean ± σ):  8.66% ±  9.01%

  ▂█▅ ▁                                                        
  ███▄█▅▁▁▁▄▁▁▁▁▁▁▁▄▁▄▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▄▁▄▅▁▇▇ ▆
  7.09 ms      Histogram: log(frequency) by time      43.3 ms <

 Memory estimate: 5.88 MiB, allocs estimate: 76811.

After:

julia> @benchmark solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
                   adaptive = false, dt = 0.1f0)
BenchmarkTools.Trial: 3789 samples with 1 evaluation.
 Range (min … max):  802.894 μs … 44.251 ms  ┊ GC (min … max):  0.00% … 61.15%
 Time  (median):     875.493 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.311 ms ±  2.630 ms  ┊ GC (mean ± σ):  26.85% ± 14.05%

  █▄▂▁                                                          
  ████▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▇▆▄▅▁▄ █
  803 μs        Histogram: log(frequency) by time      12.7 ms <

 Memory estimate: 3.86 MiB, allocs estimate: 34400.

We are now 43x faster than EnsembleGPUArray and 14x faster than EnsembleCPUArray . Moreover, now the current implementation supports parameter parallelism across u0 as well instead of p.

utkarsh530 commented 2 years ago

It might be good to know the results with 10000 trajectories:

julia> @btime sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(0.0),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  5.381 ms (138023 allocations: 8.86 MiB)

julia> @btime sol = solve(monteprob, Tsit5(), EnsembleCPUArray(),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  719.027 ms (344424 allocations: 1.14 GiB)

julia> @btime sol = solve(monteprob, Tsit5(), EnsembleGPUArray(0.0),
                          trajectories = 10000,
                          adaptive = true, dt = 0.1f0, save_everystep = false)
  166.850 ms (341201 allocations: 1.14 GiB)

Speed-ups: CPUArray: 135x GPUArray: 31.4x

Although I am not sure how practical is it to use 10,000 trajectories 😅.

ChrisRackauckas commented 2 years ago

And what's the overhead vs the most direct form, and the current reason for the overhead?

utkarsh530 commented 2 years ago

And what's the overhead vs. the most direct form, and the current reason for the overhead?

The major overhead is converting back to CPU Arrays and finally building the solution (~60-70%). The probs creation as well (~20%)

utkarsh530 commented 2 years ago

Is it okay to merge?

ChrisRackauckas commented 2 years ago

Answer the remaining question.

utkarsh530 commented 2 years ago

Cross-posted from slack:

And what’s the overhead vs. the most direct form, and the current reason for the overhead?

The reason for this overhead (converting to CPU Arrays) is to provide users to access something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us which are CUArrays. I wanted to ask whether should we leave it user to convert to CPU Arrays? I missed that thing to discuss. The probs creation seems to be necessary, but maybe could be pulled out of DiffEqGPU? Currently, it was done to adhere to DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve . And sorry for some sloppiness which may had caused any discomfort.

ChrisRackauckas commented 2 years ago

no

what does cu(probs) actually do? What dispatch is that hitting?

SciML / DiffEqGPU.jl

Enhancement and Fixes in GPU Tsit5 #170