Broadcasting and reshaped, transposed, CuArrays

mcabbott commented 4 years ago

Some broadcasting operations work fine on reshape(transpose(cu( objects, but some fail:

using CUDA
CUDA.allowscalar(false)
mat = cu(rand(2,2))

reshape(transpose(mat),2,1,2) .+ mat # ok

exp.(reshape(transpose(mat), 2,1,2)) # ERROR: scalar getindex is disallowed
CUDA.exp.(reshape(transpose(mat), 2,1,2)) # FATAL ERROR: Symbol "__nv_expf"not found

Something like this appears to be the cause of https://github.com/mcabbott/TensorCast.jl/issues/25, where the failure happens only when broadcasting two such objects, just one is fine:

C = cu(ones(10,2))
L = cu(ones(10,3))

reshape(C,1,2,10) .+ reshape(L', 3,1,10) # ok
reshape(C',1,2,10) .+ reshape(L, 3,1,10) # ok
reshape(C',1,2,10) .+ reshape(L', 3,1,10) # ERROR: scalar getindex is disallowed

This is with CUDA v0.1.0, I get the same errors on CuArrays v2.2.1, and on CuArrays v1.7.2 the only change is this:

julia> CuArrays.CUDAnative.exp.(reshape(transpose(mat), 2,1,2)) #
ERROR: LLVM error: Program used external function '___nv_expf' which could not be resolved!

maleadt commented 4 years ago

Dup of https://github.com/JuliaGPU/Adapt.jl/issues/21. With how Adapt.jl currently works, it's much too complicated/expensive to extend our support for array wrappers to doubly-wrapped arrays (here, ReshapedArray of Transpose of CuArray).

mcabbott commented 4 years ago

OK, sorry I didn't search there.

Is there a workaround? Like some mechanism to tell it to keep unwrapping, even if this can't be triggered automatically?

I think what's happening in my second example is that one singly-wrapped CuArray pushes the whole broadcast to be done by CUDA, and then it's happy... is there or should there be a way to make this happen?

zed = similar(mat,1) .= 0
zed .+ CUDA.exp.(reshape(transpose(mat), 2,1,2)) # ok!

maleadt commented 4 years ago

Is there a workaround? Like some mechanism to tell it to keep unwrapping, even if this can't be triggered automatically?

Unwrapping would be materializing the wrapper using collect, and that's probably too expensive?

Anyway, you've demonstrated a workaround yourself: make sure there's an actual CuArray (or a single-wrapped one) participating in the broadcast, The resulting broadcast style will then be the GPU one. When you do exp.(reshape(transpose(mat), 2,1,2)), there's nothing with a GPU array style in there so it'll fall back to the default, indexing one. Changing exp to CUDAnative.exp there doesn't fundamentally change the broadcast, but will just try to apply that GPU function to what is already determined will be executing on the CPU (and hence crash).

An alternative workaround is to override the broadcaststyle:

julia> C = cu(ones(10,2));

julia> L = cu(ones(10,3));

julia> Meta.@lower reshape(C',1,2,10) .+ reshape(L', 3,1,10)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = var"'"(C)
│   %2 = reshape(%1, 1, 2, 10)
│   %3 = var"'"(L)
│   %4 = reshape(%3, 3, 1, 10)
│   %5 = Base.broadcasted(+, %2, %4)
│   %6 = Base.materialize(%5)
└──      return %6

julia> a = reshape(C',1,2,10);

julia> b = reshape(L', 3,1,10);

julia> bc = Base.broadcasted(+, a, b);

julia> CUDA.allowscalar(false)

julia> Broadcast.materialize(bc)
ERROR: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar(::String) at /home/tim/Julia/pkg/GPUArrays/src/host/indexing.jl:41
 [3] getindex at /home/tim/Julia/pkg/GPUArrays/src/host/indexing.jl:96 [inlined]
 [4] _getindex at ./abstractarray.jl:1066 [inlined]
 [5] getindex at ./abstractarray.jl:1043 [inlined]
 [6] getindex at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/adjtrans.jl:190 [inlined]
 [7] _unsafe_getindex_rs at ./reshapedarray.jl:249 [inlined]
 [8] _unsafe_getindex at ./reshapedarray.jl:246 [inlined]
 [9] getindex at ./reshapedarray.jl:234 [inlined]
 [10] _getindex at ./abstractarray.jl:1083 [inlined]
 [11] getindex at ./abstractarray.jl:1043 [inlined]
 [12] _broadcast_getindex at ./broadcast.jl:614 [inlined]
 [13] _getindex at ./broadcast.jl:644 [inlined]
 [14] _broadcast_getindex at ./broadcast.jl:620 [inlined]
 [15] getindex at ./broadcast.jl:575 [inlined]
 [16] macro expansion at ./broadcast.jl:932 [inlined]
 [17] macro expansion at ./simdloop.jl:77 [inlined]
 [18] copyto! at ./broadcast.jl:931 [inlined]
 [19] copyto! at ./broadcast.jl:886 [inlined]
 [20] copy at ./broadcast.jl:862 [inlined]
 [21] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{3},Nothing,typeof(+),Tuple{Base.ReshapedArray{Float32,3,LinearAlgebra.Adjoint{Float32,CuArray{Float32,2,Nothing}},Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}},Base.ReshapedArray{Float32,3,LinearAlgebra.Adjoint{Float32,CuArray{Float32,2,Nothing}},Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}}}) at ./broadcast.jl:837
 [22] top-level scope at REPL[51]:1

julia> bc = Base.broadcasted(CUDA.CuArrayStyle{2}(), +, a, b);

julia> Broadcast.materialize(bc);

Maybe with some tooling this could be made practical for you (Broadcast.@withstyle C reshape(C',1,2,10) .+ reshape(L', 3,1,10)?)

mcabbott commented 4 years ago

Thanks for having a look, haven't got back to this.

Re tooling, I was wondering whether cu shouldn't be overloaded so that cu.(arrays...) changes the broadcast. Something similar was done by LazyArrays, with lazy.(x .+ y) delaying materialization. (It started roughly here: https://github.com/JuliaLang/julia/issues/19198#issuecomment-457967851 )

But I also wonder why broadcasting can't see this itself. It is easy to recursively call parent on an array until you get something === itself, this extracts the underlying storage type. I thought there was some similar recursion in broadcasting styles depending on the style of the parent, although am not sure. (I meant to dig a bit, but not yet.)

maleadt commented 4 years ago

Re tooling, I was wondering whether cu shouldn't be overloaded so that cu.(arrays...) changes the broadcast. Something similar was done by LazyArrays, with lazy.(x .+ y) delaying materialization. (It started roughly here: JuliaLang/julia#19198 (comment) )

That's an interesting thought! I'm wary of overloading cu though, I have been mainly stripping if of its features to make code somewhat more explicit (it used to perform much more conversions, you could do stuff like cu[1,2], etc), but for broadcast it might make sense.

mcabbott commented 4 years ago

OK, right I don't know if cu is perfect, but it might be neat.

For the original issue here, https://github.com/mcabbott/TensorCast.jl/issues/25, the ideal logic would be something that digs through wrappers, and returns an Array or a CuArray depending on what it sees.

But the other use is broadcasting over things like ranges or CartesianIndices, in which case it cannot guess, you will have to specify if you want a CuArray.

JuliaGPU / CUDA.jl

Broadcasting and reshaped, transposed, CuArrays #228