Closed utkarsh530 closed 1 year ago
Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.
We can make that be an option (with a val type). But we can also
The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.
I think for that, we can have a documented lower level API for people who really want to pull as much speed out as possible. On that note, we should make some real docs.
Sounds good to me. I will start writing some documentation for it. I can help setting up docs page for it, something aligned with SciMLDocs.
https://github.com/SciML/DiffEqGPU.jl/pull/170 The latest profile, while solving from
EnsembleGPUKernel
, raises some questions:Some overheads are discussed here for potential improvements
EnsembleGPUKernel
forTsit5
.ts,us
, which are CuArrays.Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.
DiffEqGPU
seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to thevectorized_solve.
Possible workaround: create
ps
oru0s
and pass them intoDiffEqGPU
instead of only specifying the trajectories, and the library handles the rest.If we don’t convert to CPU Arrays, we’ll get good performance (~2x faster) plus if we let user build ps (instead of asking the trajectories and building ourselves), we’ll probably reach the desired benchmark.