Closed bjarthur closed 2 months ago
tests pass locally but i have not battle tested this yet
ready for review. tests really pass locally now, and it works well in my application.
Superficially LGTM, but I don't have the time for a thorough review right now.
This probably should also integrate with the reclaim_hooks
in order to wipe the caches when running out of memory (both here and in the other fat handles).
EDIT: let's move this to a separate issue.
I’m wondering if CUDA.jl consistently reuses the same handle. I know we can have up to 32 handles, but for efficiency, we should reuse the one that stored the buffer from the previous factorization. Otherwise, we'll end up storing a lot of unnecessary workspaces.
We do as long as you're using the same task, which I assume you are. In that case, calling handle()
will always return the same object, and only a single handle will be cached.
Forgot about this; rebased to give it another CI run.
riffing off of https://github.com/JuliaGPU/CUDA.jl/pull/2279 for getrf, getrs, sytrf, sytrs and friends. much cleaner API than https://github.com/JuliaGPU/CUDA.jl/pull/2464.