Open mcabbott opened 1 year ago
FWIW, CUDA.jl intentionally does not do this because it would imply allocating a device array on every kernel launch, at least when generalizing this to nested CuArray
s. Currently, we only allow passing CuArray
, because we can directly take its pointer and wrap it in a GPU object (CuDeviceArray
) without having to allocate a GPU array. If we'd support nested arrays, say CuArray{CuArray}
, we'd first need to download all elements back to the CPU (expensive), convert the elements to CuDeviceArray
(cheap), and then allocate a new array (expensive) to upload those elements to (expensive).
This could work fine with Vector{CuArray}
because CPU allocations are much cheaper, so I guess we could support adapting those just like we adapt tuples. But the difference in behavior is not ideal.
I wonder whether
adapt
should treat an Array of Arrays as a container, like a Tuple, rather than storage to be converted. Convert the innermost Array, not the outermost:I'm not exactly sure what the rule would be, perhaps something like
isbitstype(eltype(x))
is enough?This came up in https://github.com/JuliaGPU/CUDA.jl/pull/1769, where CuIterator at present produces a
Vector{CuArray}
.