SciML / DiffEqGPU.jl

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
https://docs.sciml.ai/DiffEqGPU/stable/
MIT License
274 stars 28 forks source link

`EnsembleGPUKernel `+ Texture Memory Support #224

Open agerlach opened 1 year ago

agerlach commented 1 year ago

@ChrisRackauckas @utkarsh530

Per our conversation here is an example demonstrating how texture memory interpolation could be use.

I would like to be able to leverage CUDA.jl's texture memory support for interpolation of data in the EOM and/or in a callback. A use case could be dropping a ball in a wind field with ground impact termination for a non-flat terrain. Here, one would want to interpolate the wind field as a function of state in the eom as a forcing term and an elevation map as a function of altitude.

Below is an initial prototype. This includes a CPU implementation that leverages DataInterpolations.jl to demonstrate the functionality desired using this data forecast.txt I also included an initial non-working prototype using texture memory.

No interpolation

Working model for CPU and GPU w/o interpolation

using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

# Ballistic model
function ballistic(u, p, t, weather)
    CdS, mass, g = p
    vel = @view u[4:6]

    wind, ρ = get_weather(weather, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

# constant wind and airdensity.
function get_weather(weather, z)
    wind = SVector{3}(1f0, 0f0, 0f0)
    ρ = 1.27f0
    wind, ρ
end

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

eom_nowind = (u, p, t) -> ballistic(u,p,t,nothing)
prob = ODEProblem{false}(eom_nowind, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

esol = solve(eprob, Tsit5(); trajectories, saveat)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)

DataInterpolations.jl CPU Example

This demonstrates the basic capability I would like to replicate in w/ EnsembleGPUKernel using CUDA.CuTexture

# Ballistic w/ Interpolated winds via DataInterpolations.jl
data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

interp = DI.LinearInterpolation(weather_sa,data.altitude)

function get_weather(itp::DI.LinearInterpolation, z)
    weather = itp(u[3])
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)

GPU Texture Interpolation Validation

Demonstrate usage of CuTexture for interpolation. Note, here I index into the texture memory by broadcasting over a CuArray{Float32} of indices via dst_gpu .= getindex.(Ref(texture), idx_tlu)

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N  # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu)  # interpolation ℝ->ℝ⁴

begin
    p1 = scatter(data.altitude, data.altitude, ylabel = "Altitude", label = "Raw", ms = 2, marker = :x)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 1), label = "Texture",lw = 2)

    p2 = scatter(data.altitude, data.windx, ylabel = "Wind X", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 2), label = "Texture",lw = 2)

    p3 = scatter(data.altitude, data.windy, ylabel = "Wind Y", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 3), label = "Texture",lw = 2)

    p4 = scatter(data.altitude, data.density, ylabel = "Density", label = "Raw", ms = 2, marker = :x, leg = false)
    plot!(idx*data.altitude[end], getindex.(dst_cpu, 4), label = "Texture",lw = 2)

    plot(p1, p2, p3, p4, link=:x, xlabel = "Altitude")
end

EnsembleGPUKernel + CuTexture prototype

Non-working prototype. Note here the get_weather function is indexing the texture at a single index for a single trajectory which isn't supported by CUDA.jl. Although this is scalar indexing it should actually be occurring for each trajectory in the ensemble.

function get_weather(tex::CuTexture, z, zmax = data.altitude[end], N = length(data.altitude))
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    tex[idx]
end

eom_tex = (u,p,t) -> ballistic(u,p,t,texture)
prob = ODEProblem{false}(eom_tex, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat) #InvalidIR 

ContinousCallback Prototype

The above example only does interpolation in the eom. However, interpolation could also occur in evaluating a callback. e.g. something like this

terrain_texture = CuTexture(...)  # ℝ² -> ℝ, map xy position to ground elevation

condition(u,t,integrator) = u[3] - terrain_texture[@view u[1:2]]. # check height above elevation
cb = ContinuousCallback(condition, terminate!)
ChrisRackauckas commented 1 year ago

This looks great!

agerlach commented 1 year ago

The texture interpolation results should like this

image
utkarsh530 commented 1 year ago

EnsembleGPUKernel + DataInterpolations.jl

As a first attempt to try out interpolation within GPU ODE solves, I got the MWE working with EnsembleGPUKernel + DataInterpolations.jl.


using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

## CPU Working Example

# Ballistic model

function ballistic(u, p, t, weather)
    CdS, mass, g = p
    vel = @view u[4:6]

    wind, ρ = get_weather(weather, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

# Ballistic w/ Interpolated winds via DataInterpolations.jl

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather_sa = SVector{length(weather_sa)}(weather_sa)

interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)

function get_weather(itp::DI.LinearInterpolation, z)
    weather = itp(z)
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())

prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)

### solving the ODE on GPU + Interpolation using DataInterpolations

function ballistic_gpu(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    vel = @view u[4:6]

    # tt = interp(0.1f0)[1]

    # @cushow tt

    # wind, ρ = get_weather(interp, u[3], zmax, N)
    wind, ρ = get_weather(interp, u[3])

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

prob = ODEProblem(ballistic_gpu, u0, tspan, (p, interp))

prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)

Workaround: Using the default LinearInterpolation object caused some dynamic function invocation. Hence, I changed the interp.u to be a matrix instead of a vector of SVector. Also, I clubbed the interp buffer with parameters as using the lambda function caused some dynamic function invocation.

Attempt to use Texture Memory

I tried to modify the problem definition and investigated where dynamic invocation was happening. I had to change the building of the problem, as mentioned above. I tried to isolate some pieces here:

using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations

trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)

###

data = deserialize("forecast.txt")
N = length(data.altitude)

weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
    SVector{4}(alt, wx, wy, ρ)
end

weather = map(weather_sa) do w
    (w...,)
end

weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N  # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu)  # interpolation ℝ->ℝ⁴

def_zmax = data.altitude[end]
N = length(data.altitude)
@inline function get_weather(tex, z, zmax, N)
    idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
    weather = tex[idx]
    wind = SVector{3}(weather[2], weather[3], 0f0)
    ρ = weather[4]
    wind, ρ
end

### Experimentation

function kernel(texture, idx)
    tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    @cushow texture[idx[tid]][1]
    return
end

@cuda kernel(texture, idx_tlu)

function ballistic_t(u, p, t)
    CdS, mass, g = p[1]
    interp = p[2]
    zmax = p[3]
    N = p[4]
    vel = @view u[4:6]

    tt = interp[zmax][1]

    @cushow tt

    wind, ρ = get_weather(interp, u[3], zmax, N)

    airvelocity = vel - wind
    airspeed = norm(airvelocity)
    accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)

    return SVector{6}(vel..., accel...)
end

prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N))

function kernel_prob(prob)
    tt = prob.f(prob.u0, prob.p, 1.f0)
    return
end

@cuda kernel_prob(prob) # Works

eprob = EnsembleProblem(prob, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, adptive = false, dt = 0.001f0) #InvalidIR

After performing the solve, I get this error:

ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{true}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase at /home/gridsan/utkarsh/.julia/packages/SciMLBase/QqtZA/src/scimlfunctions.jl:2096)
Stacktrace:
 [1] tsit5_kernel
   @ ~/.julia/dev/DiffEqGPU/src/perform_step/gpu_tsit5_perform_step.jl:81

This seems a bit strange to me; a simple kernel having prob as a parameter just worked. So I'm not sure what's breaking here?. The dispatch to solve essentially is cu(probs), which is roughly mentioned here: https://docs.sciml.ai/DiffEqGPU/stable/tutorials/lower_level_api/. I can isolate more pieces here, but I'll need a different set of eyes to have a look at it. Maybe @maleadt can you look at this and help figure out how to use Texture Memory with EnsembleGPUKernel?

maleadt commented 1 year ago

Inspecting this kernel with Cthulhu (using Cthulthu, @device_code_warntype main(), press h to hide non-problematic code and navigate to the problematic code) reveals that the dynamic IR comes from the following call:

 • %419 = invoke ODEFunction(::SVector{6, Float32},::Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64},::Float32)::Union{}

That immediately shows the problem: your kernels still contain CPU datastructures (CuTexture, CuTextureArray) which first need to be converted to their GPU counterparts (CuDeviceTextire). Normally this conversion happens automatically when passing such objects to a kernel. In the case of structs containing GPU objects you need to define Adapt rules. It seems that ODEProblem already does so, because the kernel_prob invocation does already convert ODEProblem-contained GPU objects:

#kernel_prob#23(ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}, Float32, Int64}, ODEFunction{false, true, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem})

However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):

#atsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false})

The simplest solution here would be to pass a tuple of ODEProblems, because for tuples we can do the conversion efficiently. I'm not sure where that change would be needed, but @utkarsh530 or @ChrisRackauckas probably know where this comes from.

maleadt commented 1 year ago

Hmm, passing these ODE problems as a tuple isn't going to work because of their size:

using Adapt

# HACK: force a GPU array of ODE problems to be passed as a tuple
Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) =
    tuple(adapt.(Ref(to), Array(x))...)

... yields uses too much parameter space (0x22c4 bytes, 0x1100 max). So we can keep the array, but as I mentioned that conversion is expensive:

function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
    # first convert the contained ODE problems
    y = CuArray(adapt.(Ref(to), Array(x)))
    # continue doing what the default method does
    Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end

And with that, the kernel launches and runs 🙂

utkarsh530 commented 1 year ago

The array of problems are built here: https://github.com/SciML/DiffEqGPU.jl/blob/70b0820e84c94e94fd9dcdf5c9b81c72be1e62ea/src/DiffEqGPU.jl#L598-L606, and the cu(probs) dispatch is here: https://github.com/SciML/DiffEqGPU.jl/blob/70b0820e84c94e94fd9dcdf5c9b81c72be1e62ea/src/DiffEqGPU.jl#L690-L696

But yes, I understood your point as to why that wasn't working. We can definitely try to figure out how "expensive" it is, but it works fine currently

utkarsh530 commented 1 year ago

Here's the comparison with interp as DataInterpolations.jl vs CUDA texture memory:

Script ``` using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra import DataInterpolations const DI = DataInterpolations trajectories = 100 u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0] tspan = (0.0f0, 50.0f0) saveat = LinRange(tspan..., 100) p = @SVector [25f0, 225f0, 9.807f0] CdS_dist = Normal(0f0, 1f0) ### data = deserialize("forecast.txt") N = length(data.altitude) weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ SVector{4}(alt, wx, wy, ρ) end weather = map(weather_sa) do w (w...,) end weather_TA = CuTextureArray(weather) texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation()) ## Test Texture interpolation idx = LinRange(0f0, 1f0, 4000) idx_gpu = CuArray(idx) idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx)) dst_gpu .= getindex.(Ref(texture), idx_tlu) # interpolation ℝ->ℝ⁴ def_zmax = data.altitude[end] N = length(data.altitude) @inline function get_weather(tex, z, zmax, N) idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup weather = tex[idx] wind = SVector{3}(weather[2], weather[3], 0f0) ρ = weather[4] wind, ρ end ### Experimentation function ballistic_t(u, p, t) CdS, mass, g = p[1] interp = p[2] zmax = p[3] N = p[4] vel = @view u[4:6] wind, ρ = get_weather(interp, u[3], zmax, N) airvelocity = vel - wind airspeed = norm(airvelocity) accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g) return SVector{6}(vel..., accel...) end prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N)) using Adapt function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) # first convert the contained ODE problems y = CuArray(adapt.(Ref(to), Array(x))) # continue doing what the default method does Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y) end prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), texture, def_zmax, N)) eprob_texture = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false) esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat) ## CPU Working Example # Ballistic model trajectories = 100 u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0] tspan = (0.0f0, 50.0f0) saveat = LinRange(tspan..., 100) p = @SVector [25f0, 225f0, 9.807f0] CdS_dist = Normal(0f0, 1f0) # Ballistic w/ Interpolated winds via DataInterpolations.jl data = deserialize("forecast.txt") N = length(data.altitude) weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ SVector{4}(alt, wx, wy, ρ) end weather_sa = SVector{length(weather_sa)}(weather_sa) interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude) function get_weather(itp::DI.LinearInterpolation, z) weather = itp(z) wind = SVector{3}(weather[2], weather[3], 0f0) ρ = weather[4] wind, ρ end ### solving the ODE on GPU + Interpolation using DataInterpolations function ballistic_gpu(u, p, t) CdS, mass, g = p[1] interp = p[2] vel = @view u[4:6] wind, ρ = get_weather(interp, u[3]) airvelocity = vel - wind airspeed = norm(airvelocity) accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g) return SVector{6}(vel..., accel...) end prob_interp = ODEProblem(ballistic_gpu, u0, tspan, (p, interp)) prob_func = (prob, i, repeat) -> remake(prob_interp, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp)) eprob_interp = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false) esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat) ```

Benchmarking:

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 705 samples with 1 evaluation.
 Range (min … max):  5.657 ms … 34.047 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.443 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.090 ms ±  3.021 ms  ┊ GC (mean ± σ):  0.81% ± 3.84%

  ▃▇█▇▄▁
  ███████▆▆▅▆▇▅▅▄▄▄▁▁▁▁▁▁▄▁▁▁▄▁▁▄▁▁▄▁▄▁▁▁▁▁▄▁▄▁▄▄▆▁▁▁▁▁▄▁▁▅▅ ▇
  5.66 ms      Histogram: log(frequency) by time     23.9 ms <

 Memory estimate: 850.06 KiB, allocs estimate: 12149.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 393 samples with 1 evaluation.
 Range (min … max):   9.966 ms … 48.706 ms  ┊ GC (min … max): 0.00% … 48.42%
 Time  (median):     11.320 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.722 ms ±  5.161 ms  ┊ GC (mean ± σ):  3.32% ±  8.46%

  ▅▇▇█▆
  █████▅▇▅▇▆▄▄▁▁▅▅▄▁▁▄▆▄▅▁▁▁▄▁▄▁▁▄▁▁▁▄▄▄▄▄▁▄▄▄▄▆▄▄▁▁▁▄▄▁▁▁▁▄▅ ▆
  9.97 ms      Histogram: log(frequency) by time      35.7 ms <

 Memory estimate: 4.84 MiB, allocs estimate: 39854.
agerlach commented 1 year ago

@utkarsh530 is that the right script? There is no eprob_texture in the script you included.

utkarsh530 commented 1 year ago

Sorry, I just updated it.

agerlach commented 1 year ago

I am trying to test this on an A100 and I get the following on all the solves.

julia> esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
ERROR: UndefKeywordError: keyword argument dt not assigned
Stacktrace:
 [1] batch_solve_up_kernel(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, DataInterpolations.LinearInterpolation{SMatrix{4, 64, Float32, 256}, LinRange{Float32, Int64}, true, Float32}}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_gpu), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:382
 [2] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:345
 [3] macro expansion
   @ ./timing.jl:382 [inlined]
 [4] __solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, LinRange{Float32, Int64}, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{LinRange{Float32, Int64}}}})
   @ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:254
 [5] #solve#33
   @ ~/.julia/packages/DiffEqBase/Lq1gG/src/solve.jl:851 [inlined]
 [6] top-level scope
   @ ~/GPUODEBenchmarks/GPU_ODE_SciML/Texture/wind.jl:132
agerlach commented 1 year ago

nvm, up took care of the issue

agerlach commented 1 year ago

A100 results for comparison.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 849 samples with 1 evaluation.
 Range (min … max):  5.573 ms … 33.261 ms  ┊ GC (min … max): 0.00% … 69.47%
 Time  (median):     5.712 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.887 ms ±  1.619 ms  ┊ GC (mean ± σ):  1.40% ±  4.25%

    ▃█▇▅▁          ▁▁▁                                        
  ▃▆█████▆▃▃▄▃▅▅▅▇▇███▆▄▃▄▃▃▃▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▁▂▃▂▂▃▂▃▂▂▂▃▂▂▁▂ ▃
  5.57 ms        Histogram: frequency by time        6.53 ms <

 Memory estimate: 848.19 KiB, allocs estimate: 12097.

julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 379 samples with 1 evaluation.
 Range (min … max):  11.605 ms … 47.908 ms  ┊ GC (min … max): 0.00% … 70.28%
 Time  (median):     12.987 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.203 ms ±  4.656 ms  ┊ GC (mean ± σ):  4.67% ±  9.64%

  ▇ █                                                          
  █▇█▄▄▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▅ ▆
  11.6 ms      Histogram: log(frequency) by time      47.1 ms <

 Memory estimate: 4.84 MiB, allocs estimate: 39802.
agerlach commented 1 year ago

@utkarsh530 I am trying to benchmark the Texture memory for different number of trajectories (above only uses 100) trajectories. However, I am noticing when doing GPUTsit5() for large numbers, e.g. 8388608. I have essentially 0% GPU usage according to watch -n 0.2 nvidia-smi with random blips to ~25%, however I essentially have 100% CPU usage all the time. Is it possible that it is spending almost all its time constructing the solution on the CPU? If so, can that be multi-threaded?

utkarsh530 commented 1 year ago

Generally, that's the case with the EnsembleGPUKernel that competes to solve on GPU is faster than converting back to CPU arrays + building SciMLSolution. Can you try to use https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/ ? It does not perform any expensive operations required on the CPU.

agerlach commented 1 year ago
script ```julia using Pkg; cd(@__DIR__); Pkg.activate(".") using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra, Adapt import DataInterpolations const DI = DataInterpolations function ballistic(u, p, t) CdS, mass, g = p[1] interp = p[2] zmax = p[3] N = p[4] vel = @view u[4:6] wind, ρ = get_weather(interp, u[3], zmax, N) airvelocity = vel - wind airspeed = norm(airvelocity) accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g) return SVector{6}(vel..., accel...) end @inline function get_weather(tex, z, zmax, N) idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup weather = tex[idx] wind = SVector{3}(weather[2], weather[3], 0f0) ρ = weather[4] wind, ρ end @inline function get_weather(itp::DI.LinearInterpolation, z, zmax, N) weather = itp(z) wind = SVector{3}(weather[2], weather[3], 0f0) ρ = weather[4] wind, ρ end function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) # first convert the contained ODE problems y = CuArray(adapt.(Ref(to), Array(x))) # continue doing what the default method does Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y) end # build interpolants data = deserialize(joinpath(@__DIR__,"data","forecast.txt")) N = length(data.altitude) def_zmax = data.altitude[end] weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ SVector{4}(alt, wx, wy, ρ) end weather_sa = SVector{length(weather_sa)}(weather_sa) interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude) weather = map(weather_sa) do w (w...,) end weather_TA = CuTextureArray(weather) texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation()) ### Simulation parameters trajectories = 10_000 u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0] tspan = (0.0f0, 40.0f0) saveat = LinRange(tspan..., 100) p = @SVector [25f0, 225f0, 9.807f0] p_tx = (p, texture, def_zmax, N) p_di = (p, interp, def_zmax, N) CdS_dist = Normal(0f0, 1f0) prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), prob.p[2:end]...)) ### Texture Solve Test # High Level prob_tx = ODEProblem(ballistic, u0, tspan, p_tx) eprob_tx = EnsembleProblem(prob_tx; prob_func, safetycopy = false) @time esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat) # Low Level @time begin probs_tex = map(1:trajectories) do i prob_func(prob_tx, i, false) end |> cu ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat) end ### DI Solve Test # High Level prob_di = ODEProblem(ballistic, u0, tspan, p_di) eprob_di = EnsembleProblem(prob_di; prob_func, safetycopy = false) @time esol_di = solve(eprob_di, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat) # Low Level @time begin probs_di = map(1:trajectories) do i prob_func(prob_di, i, false) end |> cu ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); saveat) end # trajs = 8*4 .^(0:9) using BenchmarkTools BenchmarkTools.DEFAULT_PARAMETERS.samples = 3 trajs = 8*4 .^(0:8) times = map(trajs[end]) do traj @show traj tx_lo = @benchmark @CUDA.sync begin probs_tex = map(1:$traj) do i prob_func(prob_tx, i, false) end |> cu ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); $saveat) end display(tx_lo) di_lo = @benchmark @CUDA.sync begin probs_di = map(1:$traj) do i prob_func(prob_di, i, false) end |> cu ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); $saveat) end display(di_lo) tx = @benchmark @CUDA.sync esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat) display(tx) di = @benchmark @CUDA.sync esol_gpu = solve(eprob_di , GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat) display(di) dicpu = @benchmark esol_cpu = solve(eprob_di , Tsit5(), EnsembleThreads(); trajectories = $traj, saveat = $saveat) display(dicpu) (tx = minimum(tx.times) / 1e6, di = minimum(di.times) / 1e6, dicpu = minimum(dicpu.times) / 1e6, tx_lo = minimum(tx_lo.times) / 1e6, di_lo = minimum(di_lo.times) / 1e6) end using Plots begin plt = plot(trajs, getindex.(times, :tx), label = "Texture", marker = :utriangle, legend = :topleft, yaxis = :log, xaxis=:log) plot!(trajs, getindex.(times, :di), label = "Software", marker = :ltriangle) plot!(trajs, getindex.(times, :dicpu), label = "CPU", marker = :square) plot!(trajs, getindex.(times, :tx_lo), label = "Texture Low-Level", marker = :dtriangle, legend = :topleft) plot!(trajs, getindex.(times, :di_lo), label = "Software Low-Level", marker = :rtriangle) xlabel!("Trajectories") ylabel!("(ms)") plt end ```
image

The low-level times includes the time to create and copy the probs. CPU times is using EnsembleThreads() on a 12 core machine. GPU is A100 40GB.

For say 524288 trajectories, if I time just the solve with the low level interface, I get

@time ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat);
 14.335337 seconds (51.38 M allocations: 1.992 GiB)

During this low-level solve I get 100% CPU usage on a single core and 0% usage on the GPU until right before the solve completes where it blips to ~30%.

utkarsh530 commented 1 year ago

However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):

I am not sure, but the 100% CPU usage could be due to expensive GPU kernel launches requiring multiple uploads to GPU and CPU due to the reason Tim pointed out earlier. Even the scaling of the plot with trajectories seems a bit off, compared to plain ODE solves benchmarks. IMHO, we should only compare the time spent on solving the ODE rather than setup (which is generating GPU arrays of ODEProblem being parallelized before the solve, similar to just generating an array of parameters before and uploading them on GPU. (A common setup in other programs)).

utkarsh530 commented 1 year ago

I think @maleadt might be able to comment better here.

agerlach commented 1 year ago

After reviewing @maleadt's earlier comments, I wonder if makes sense to have a way to pass some parameters that are "global" to all ensembles or as Refs, so that the conversion needs to only happen once.

agerlach commented 1 year ago

So, I've been experimenting with different options for passing the CuTexture w/o incurring the huge conversion overhead and I am at a bit of a loss for what I am observing. Because the texture is constant for all problems in the ensemble, I would think that I should be able to just write an ode in the form eom(u, p, t, texture) and do a closure over the texture. However, I still get a dynamic invocation with CuTexture instead of CuDeviceTexture.

NOTE: I am trying to avoid using the Adapt rule above due to the huge conversion overhead.

Here is a MWE demonstrating the issue

First, create a texture that is just 0 everywhere and verify that it can be used in a closure w/ proper conversions. This works as expected.

using CUDA, DiffEqGPU, OrdinaryDiffEq, StaticArrays, Adapt

data = map(1:5000) do w
    (zeros(Float32, 4)...,)
end

texture = CuTexture( CuTextureArray(data); address_mode = CUDA.ADDRESS_MODE_CLAMP, 
                        normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())

idx_gpu = CuArray(LinRange(0f0, 1f0, 4000))
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx_gpu))

function interp!(dst, idx, tex)
    dst .= getindex.(Ref(tex), idx)
end

cl_let = let tex = texture
    (d,i) -> interp!(d, i, tex)
end
cl_let(dst_gpu, idx_gpu);

Next, lets define a simple ODE that accepts a texture as an argument but does nothing with it and solves over a closure of tex

function eom(u, p, t, tex)
    return @SVector zeros(Float32, 4)
end

trajectories = 8192
u0 = @SVector zeros(Float32, 4)
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)

prob_func = (prob, i, repeat) -> remake(prob, u0 = @SVector randn(Float32, 4))

cl = let tex = texture
        (x,p,t)->eom(x, p, t, tex)
end

prob = ODEProblem(cl, u0, tspan)
eprob = EnsembleProblem(prob; prob_func, safetycopy = false)
esol = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)

This solves with no issue. Note: no adapt rule was defined.

Next, lets solve similarly but with an ODE that uses the texture

function eom_tex(u, p, t, tex)
    @inbounds w = tex[0.5] #NTuple of zeros
    return SVector(w...)
end

cl_tex = let tex = texture
    (x,p,t)->eom_tex(x, p, t, tex)
end

prob_tex = ODEProblem(cl_tex, u0, tspan)
eprob_tex = EnsembleProblem(prob_tex; prob_func, safetycopy = false)
esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)

leading to

julia> esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
ERROR: InvalidIRError: compiling kernel #atsit5_kernel(CuDeviceVector{ODEProblem{false,SVector{4, Float32},Tuple{Float32, Float32},…}, 1}, CuDeviceMatrix{SVector{4, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, CuDeviceVector{Float32, 1}, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase 

On inspection I see we still have a CuTexture in the function call

 • %2 = invoke eom_tex(::SVector{4, Float32},::SciMLBase.NullParameters,::Float32,::CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}})::Union{}

Why? It seems like the proper conversion occurred in the other examples using a closure over the texture. If I add the adapt rule, then it works. However it spends almost all the time converting. I don't understand why it is needed though if the texture is not an explicit parameter and is used in a closure instead.

maleadt commented 1 year ago

Sorry for the slow response, I needed some time to catch up with my GH notifications 🙂

Why? It seems like the proper conversion occurred in the other examples using a closure over the texture.

That's because in those cases you were invoking a closure directly, and Adapt.jl has rules (Adapt.adapt_structure methods) to recursively convert the captures of a closure. Those rules are then used by CUDA.jl to convert the function (closure) you're invoking, as well as any of its arguments, to a GPU-compatible representation (CuTexture->CuDeviceTexture, etc).

Here, however, a kernel is invoked with an EnsembleProblem argument, which contains an ODEProblem, which contains an ODEFunction, which contains a closure that captures a CuTexture and a CuTextureArray. Although CUDA will use Adapt to try and convert such an argument to a device-compatible representation, there are no Adapt rules defined for these types of objects, so the conversion is a no-op.

So basically, there would need to be Adapt rules for each of these types so that kernel conversion recurses into the objects when the EnsembleProblem is passed to a kernel. Alternatively, the code constructing an EnsembleProblem could manually call cudaconvert on the fields that need to be converted, but depending on how that is done it would still need rules for all but the outermost object type (and also breaks platform portability).

One simple way to add Adapt rules is to use @adapt_structure, but that relies on default constructors being present, and that's not the case for ODEFunction:

julia> Adapt.@adapt_structure EnsembleProblem

julia> Adapt.@adapt_structure ODEProblem

julia> Adapt.@adapt_structure ODEFunction

julia> cudaconvert(eprob_tex)
ERROR: MethodError: no method matching ODEFunction(::var"#14#15"{CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}}, ::LinearAlgebra.UniformScaling{Bool}, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::typeof(SciMLBase.DEFAULT_OBSERVED), ::Nothing, ::Nothing)