Initial work on CUDA-compat

torfjelde commented 1 year ago

It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.

But if we move the overloads into the main package, they run. So probably should do that from now on.

zuhengxu commented 1 year ago

I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

x = rand(rng, q0_g) # good

However, there is still issue to fix---sample multiple samples at once, and sample from Bijectors.TransformedDistribuition . Minimal examples are as follows:

sample multiple samples in one batch

xs = rand(rng, q0_g, 10) # ambiguous

error message:


ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.

Candidates: rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64) @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23 rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64) @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21 rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64) @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16

Possible fix, define rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)

Stacktrace: [1] top-level scope @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42


- sample from `Bijectors.TransformedDistribution`:
```julia
y = rand(rng, flow_g) # ambiguous

err meesage:

ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
    @ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7

Possible fix, define
  rand(::CUDA.RNG, ::MultivariateTransformed)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40

This is partially because we are overloading methods and types that do not own by this pkg. Any thoughts about how to address this @torfjelde @sunxd3?

sunxd3 commented 1 year ago

I don't have a immediate solution other than the suggested fixes. It is indeed a bit annoying, maybe we don't dispatch on rng?

zuhengxu commented 1 year ago

It is indeed a bit annoying, maybe we don't dispatch on rng?

Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for Distribution.rand, something like device to indicate on cpu or on gpu. But for long term fix, Im now leaning towards your previous attempts. Although this will require resolving some compatibility issue with Bijectors.

torfjelde commented 1 year ago

Honestly, IMO, the best solution right now is just to add our own rand for now to avoid ambiguity errors.

If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a Union as we've done now, which will take time and effort.

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

How does that sound?

zuhengxu commented 1 year ago

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now!

zuhengxu commented 1 year ago

I have adapted the NF.rand_device() approach. I think now we have a work around. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts = reduce(∘, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

@torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs.

TuringLang / NormalizingFlows.jl

Initial work on CUDA-compat #25