Closed utkarsh530 closed 2 years ago
I am not able to upload .mem
file here, I can share it on slack, maybe.
Do we need these assertion checks here?: https://github.com/SciML/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L311-L312 They cause allocations.
Why does that allocate? Is there a way using Fix2
to make it not allocate?
Made a fix with Fix2
. It is now reduced to 2 allocations.
Some speed-ups as well:
Before:
julia> @benchmark sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
adaptive = false, dt = dt)
BenchmarkTools.Trial: 497 samples with 1 evaluation.
Range (min β¦ max): 7.095 ms β¦ 74.585 ms β GC (min β¦ max): 0.00% β¦ 24.81%
Time (median): 8.108 ms β GC (median): 0.00%
Time (mean Β± Ο): 10.051 ms Β± 7.866 ms β GC (mean Β± Ο): 8.66% Β± 9.01%
βββ
β
ββββββ
ββββββββββββββββββββββββββββββββββββββββββββββ
ββββ
βββ β
7.09 ms Histogram: log(frequency) by time 43.3 ms <
Memory estimate: 5.88 MiB, allocs estimate: 76811.
After:
julia> @benchmark solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(), trajectories = 1000,
adaptive = false, dt = 0.1f0)
BenchmarkTools.Trial: 3789 samples with 1 evaluation.
Range (min β¦ max): 802.894 ΞΌs β¦ 44.251 ms β GC (min β¦ max): 0.00% β¦ 61.15%
Time (median): 875.493 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 1.311 ms Β± 2.630 ms β GC (mean Β± Ο): 26.85% Β± 14.05%
ββββ
βββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ β
803 ΞΌs Histogram: log(frequency) by time 12.7 ms <
Memory estimate: 3.86 MiB, allocs estimate: 34400.
We are now 43x faster than EnsembleGPUArray
and 14x faster than EnsembleCPUArray
.
Moreover, now the current implementation supports parameter parallelism across u0
as well instead of p
.
It might be good to know the results with 10000 trajectories:
julia> @btime sol = solve(monteprob, GPUTsit5(), EnsembleGPUAutonomous(0.0),
trajectories = 10000,
adaptive = true, dt = 0.1f0, save_everystep = false)
5.381 ms (138023 allocations: 8.86 MiB)
julia> @btime sol = solve(monteprob, Tsit5(), EnsembleCPUArray(),
trajectories = 10000,
adaptive = true, dt = 0.1f0, save_everystep = false)
719.027 ms (344424 allocations: 1.14 GiB)
julia> @btime sol = solve(monteprob, Tsit5(), EnsembleGPUArray(0.0),
trajectories = 10000,
adaptive = true, dt = 0.1f0, save_everystep = false)
166.850 ms (341201 allocations: 1.14 GiB)
Speed-ups: CPUArray: 135x GPUArray: 31.4x
Although I am not sure how practical is it to use 10,000 trajectories π .
And what's the overhead vs the most direct form, and the current reason for the overhead?
And what's the overhead vs. the most direct form, and the current reason for the overhead?
The major overhead is converting back to CPU Arrays and finally building the solution (~60-70%). The probs
creation as well (~20%)
Is it okay to merge?
Answer the remaining question.
Cross-posted from slack:
And whatβs the overhead vs. the most direct form, and the current reason for the overhead?
The reason for this overhead (converting to CPU Arrays) is to provide users to access something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us which are CUArrays. I wanted to ask whether should we leave it user to convert to CPU Arrays? I missed that thing to discuss. The probs creation seems to be necessary, but maybe could be pulled out of DiffEqGPU? Currently, it was done to adhere to DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve . And sorry for some sloppiness which may had caused any discomfort.
no
what does cu(probs) actually do? What dispatch is that hitting?
Hi,
Wrt the previous discussion on speed-up issues with DiffEqGPU API with the new GPU solvers, I set to bridge the performance gap (46 Β΅s to 10.5 ms π±) between raw GPU solutions and using the Ensemble DiffEqGPU API. A small commit fix improved the speed by ~ 2.4 ms. (The u0 and p
hcat
was only needed by previous solving techniques). Pros: It is now faster thanEnsembleGPUArray
andEnsembleCPUArray
.Before:
After:
With some profiling and tracking allocations (attached in the next comment), I was able to lower these issues:
u0
andp
build fromEnsembleGPUAutonomous
u0
andp
inEnsembleGPUArray
@assert
necessityus
arrays to CPU arrays@ChrisRackauckas, have a look.