SciML / DiffEqGPU.jl

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
https://docs.sciml.ai/DiffEqGPU/stable/
MIT License
279 stars 29 forks source link

Scope of improvements in EnsembleGPUKernel #171

Closed utkarsh530 closed 1 year ago

utkarsh530 commented 2 years ago

https://github.com/SciML/DiffEqGPU.jl/pull/170 The latest profile, while solving from EnsembleGPUKernel, raises some questions:

Some overheads are discussed here for potential improvements EnsembleGPUKernel for Tsit5.

  1. Converting the solution back to CuArrays. The reason for this overhead (converting to CPU Arrays) is to provide users access to something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us, which are CuArrays.

Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.

  1. Ensemble problem creation for parameter parallelism The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.

Possible workaround: create ps or u0s and pass them into DiffEqGPU instead of only specifying the trajectories, and the library handles the rest.

If we don’t convert to CPU Arrays, we’ll get good performance (~2x faster) plus if we let user build ps (instead of asking the trajectories and building ourselves), we’ll probably reach the desired benchmark.

ChrisRackauckas commented 2 years ago

Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.

We can make that be an option (with a val type). But we can also

The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.

I think for that, we can have a documented lower level API for people who really want to pull as much speed out as possible. On that note, we should make some real docs.

utkarsh530 commented 2 years ago

Sounds good to me. I will start writing some documentation for it. I can help setting up docs page for it, something aligned with SciMLDocs.