[BUG] conv_im2col! Scalar Indexing in CUDA (`conv_im2col`)

gortibaldik commented 1 year ago

I have problems with ScalarIndexing when training neural network transfered to gpu.

Julia Version

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores

I'm running the code in the Pluto notebook!

(PlutoNotebooks) pkg> st -m Pluto
Status `~/Documents/Skola/julia/PlutoNotebooks/Manifest.toml`
[c3e4b0f8] Pluto v0.19.18

Flux Version

(PlutoNotebooks) pkg> st -m Flux
Status `~/Documents/Skola/julia/PlutoNotebooks/Manifest.toml`
  [587475ba] Flux v0.13.9

As I do not know how to create more minimal bug example, I will show how the error occurs on MNIST dataset:

Minimal Bug Example

I create a net as here:

I try to use as little layers as possible, so that I show that the error really happens in Conv((2, 2), 1=>16, relu)
```
using Flux
using CUDA
using Flux: flatten
```

function define_net() net = Chain( Conv((2, 2), 1=>16, relu), flatten, Dense(11664, size(y_train, 1)), softmax, ) end


Then I use basic MNIST dataset:
```julia
using MLDatasets
X_train_old, y_train_old    = MLDatasets.MNIST(T, :train)[:]

Reshape it into size (w, h, c, N), normalize

using Flux: onehotbatch, onecold

function reshape_data(data::AbstractArray{<:Real, 3})
    # reshapes the data such that it has one channel
    s = size(data)
    reshape(data, s[1], s[2], 1, s[3])
end

reshape_data(data::AbstractArray{<:Real, 4}) = data

function load_data(dataset; T=Float32, onehot=false, classes=0:9)
    X_train, y_train = dataset(T, :train)[:]
    X_test, y_test = dataset(T, :test)[:]

    X_train = reshape_data(X_train)
    X_test = reshape_data(X_test)

    if onehot
        y_train = onehotbatch(y_train, classes)
        y_test = onehotbatch(y_test, classes)
    end

    X_train, y_train, X_test, y_test
end

I use this function for training:

using Flux.Data: DataLoader
using BSON

function train_model!(net, Loss, X, y;
    opt = Descent(0.1),
    batchsize=128,
    n_epochs=10,
    file_name=""
)
    batches = DataLoader((X, y); batchsize, shuffle=true)

    for current_epoch in 1:n_epochs
        Flux.train!(Loss, params(net), batches, opt)
    end

    # save the model
    !isempty(file_name) && BSON.bson(file_name, net=net)
end

and here is the invocation:

using Flux: crossentropy, params
net = define_net()
Loss(X, y) = crossentropy(net(X), y)

train_model!(net, Loss, X_train, y_train; n_epochs=5, file_name="MNIST_simple.bson")
accuracy(X_test, y_test, net, 0:9)

I have absolutely no problems when running this code when I do not use gpu

so let's now try to use the gpu

net = define_net()
Loss(X, y) = crossentropy(net(X), y)

# send the model to gpu
gpu_net = net |> gpu
gpu_X_train, gpu_X_test = X_train |> gpu, X_test |> gpu
gpu_y_train, gpu_y_test = y_train |> gpu, y_test |> gpu

train_model!(gpu_net, Loss, gpu_X_train, gpu_y_train; n_epochs=5, file_name="MNIST_simple.bson")

Output:

TaskFailedException

nested task error: TaskFailedException

Stacktrace:

[1] wait

@ ./task.jl:345 [inlined]

[2] threading_run(fun::NNlib.var"#943#threadsfor_fun#533"{NNlib.var"#943#threadsfor_fun#532#534"{CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, Float32, Float32, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, NNlib.DenseConvDims{3, 3, 3, 6, 3}, Int64, Int64, Int64, UnitRange{Int64}}}, static::Bool)

@ Base.Threads ./threadingconstructs.jl:38

[3] macro expansion

@ ./threadingconstructs.jl:89 [inlined]

[4] #conv_im2col!#531

@ ~/.julia/packages/NNlib/c0XLe/src/impl/conv_im2col.jl:47 [inlined]

[5] conv_im2col!(y::SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, x::SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, w::SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, cdims::NNlib.DenseConvDims{3, 3, 3, 6, 3})

@ NNlib ~/.julia/packages/NNlib/c0XLe/src/impl/conv_im2col.jl:23

[6] (::NNlib.var"#262#266"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, NNlib.DenseConvDims{3, 3, 3, 6, 3}, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}})()

@ NNlib ./threadingconstructs.jl:258

nested task error: Scalar indexing is disallowed.

Invocation of getindex resulted in scalar indexing of a GPU array.

This is typically caused by calling an iterating implementation of a method.

Such implementations *do not* execute on the GPU, but very slowly on the CPU,

and therefore are only permitted from the REPL for prototyping purposes.

If you did intend to index this array, annotate the caller with @allowscalar.

Stacktrace:

[1] error(s::String)

@ Base ./error.jl:35

[2] assertscalar(op::String)

@ GPUArraysCore ~/.julia/packages/GPUArraysCore/lojQM/src/GPUArraysCore.jl:87

[3] getindex(::CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64, ::Int64, ::Vararg{Int64})

@ GPUArrays ~/.julia/packages/GPUArrays/fqD8z/src/host/indexing.jl:9

[4] getindex

@ ./subarray.jl:282 [inlined]

[5] im2col!(col::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x::SubArray{Float32, 4, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Int64}, true}, cdims::NNlib.DenseConvDims{3, 3, 3, 6, 3})

@ NNlib ~/.julia/packages/NNlib/c0XLe/src/impl/conv_im2col.jl:228

[6] macro expansion

@ ~/.julia/packages/NNlib/c0XLe/src/impl/conv_im2col.jl:51 [inlined]

[7] (::NNlib.var"#943#threadsfor_fun#533"{NNlib.var"#943#threadsfor_fun#532#534"{CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, Float32, Float32, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, NNlib.DenseConvDims{3, 3, 3, 6, 3}, Int64, Int64, Int64, UnitRange{Int64}}})(tid::Int64; onethread::Bool)

@ NNlib ./threadingconstructs.jl:84

[8] #943#threadsfor_fun

@ ./threadingconstructs.jl:51 [inlined]

[9] (::Base.Threads.var"#1#2"{NNlib.var"#943#threadsfor_fun#533"{NNlib.var"#943#threadsfor_fun#532#534"{CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, Float32, Float32, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, NNlib.DenseConvDims{3, 3, 3, 6, 3}, Int64, Int64, Int64, UnitRange{Int64}}}, Int64})()

@ Base.Threads ./threadingconstructs.jl:30

    sync_end(::Channel{Any})@task.jl:436
    macro expansion@task.jl:455[inlined]
    var"#conv!#258"(::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(NNlib.conv!), ::CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, 3, 3, 6, 3})@conv.jl:195
    conv!@conv.jl:182[inlined]
    var"#conv!#221"(::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(NNlib.conv!), ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Array{Float32, 4}, ::NNlib.DenseConvDims{2, 2, 2, 4, 2})@conv.jl:145
    conv!@conv.jl:140[inlined]
    var"#conv#196"(::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(NNlib.conv), ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Array{Float32, 4}, ::NNlib.DenseConvDims{2, 2, 2, 4, 2})@conv.jl:88
    conv@conv.jl:83[inlined]
    #rrule#312@conv.jl:313[inlined]
    rrule@conv.jl:303[inlined]
    rrule@rules.jl:134[inlined]
    chain_rrule@chainrules.jl:218[inlined]
    macro expansion@interface2.jl:0[inlined]
    _pullback@interface2.jl:9[inlined]
    _pullback@conv.jl:200[inlined]
    _pullback(::Zygote.Context{true}, ::Flux.Conv{2, 4, typeof(NNlib.relu), Array{Float32, 4}, Vector{Float32}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})@interface2.jl:0
    macro expansion@basic.jl:53[inlined]
    _pullback@basic.jl:53[inlined]
    _pullback(::Zygote.Context{true}, ::typeof(Flux._applychain), ::Tuple{Flux.Conv{2, 4, typeof(NNlib.relu), Array{Float32, 4}, Vector{Float32}}, Flux.MaxPool{2, 4}, Flux.Conv{2, 4, typeof(NNlib.relu), Array{Float32, 4}, Vector{Float32}}, Flux.MaxPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, typeof(NNlib.softmax)}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})@interface2.jl:0
    _pullback@basic.jl:51[inlined]
    _pullback(::Zygote.Context{true}, ::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(NNlib.relu), Array{Float32, 4}, Vector{Float32}}, Flux.MaxPool{2, 4}, Flux.Conv{2, 4, typeof(NNlib.relu), Array{Float32, 4}, Vector{Float32}}, Flux.MaxPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, typeof(NNlib.softmax)}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})@interface2.jl:0
    _pullback@[Local: 3](http://localhost:1234/edit?id=e1f7d328-7c73-11ed-376f-95c3581ab814#)[inlined]
    _pullback(::Zygote.Context{true}, ::typeof(Main.var"workspace#14".Loss), ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::OneHotArrays.OneHotMatrix{UInt32, CUDA.CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}})@interface2.jl:0
    _apply@boot.jl:816[inlined]
    adjoint@lib.jl:203[inlined]
    _pullback@adjoint.jl:65[inlined]
    _pullback@train.jl:143[inlined]
    _pullback(::Zygote.Context{true}, ::Flux.Optimise.var"#37#40"{typeof(Main.var"workspace#14".Loss), Tuple{CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, OneHotArrays.OneHotMatrix{UInt32, CUDA.CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}}})@interface2.jl:0
    pullback(::Function, ::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})@interface.jl:384
    withgradient(::Function, ::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})@interface.jl:132
    macro expansion@train.jl:142[inlined]
    macro expansion@ProgressLogging.jl:328[inlined]
    var"#train!#36"(::Flux.Optimise.var"#38#41", ::typeof(Flux.Optimise.train!), ::Function, ::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, ::MLUtils.DataLoader{Tuple{CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, OneHotArrays.OneHotMatrix{UInt32, CUDA.CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}}, Random._GLOBAL_RNG, Val{nothing}}, ::Flux.Optimise.Descent)@train.jl:140
    train!@train.jl:136[inlined]
    var"#train_model!#1"(::Flux.Optimise.Descent, ::Int64, ::Int64, ::String, ::typeof(Main.var"workspace#5".train_model!), ::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(NNlib.relu), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.MaxPool{2, 4}, Flux.Conv{2, 4, typeof(NNlib.relu), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.MaxPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(NNlib.softmax)}}, ::Function, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::OneHotArrays.OneHotMatrix{UInt32, CUDA.CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}})@[Other: 14](http://localhost:1234/edit?id=e1f7d328-7c73-11ed-376f-95c3581ab814#)
    top-level scope@[Local: 10](http://localhost:1234/edit?id=e1f7d328-7c73-11ed-376f-95c3581ab814#)

gortibaldik commented 1 year ago

Obviously I added wrong code for training net, where I didn't use gpu_ data and network for the training. Now the issue contains the right code.

mcabbott commented 1 year ago

The error is because conv is getting a mix of CuArray and Array input:

[5] conv_im2col!(
y::SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, 
x::SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, 
w::SubArray{Float32, 5, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, 
cdims::NNlib.DenseConvDims{3, 3, 3, 6, 3}
)

And the reason for that is that Loss(X, y) = crossentropy(net(X), y) closes over net, not gpu_net.

The fact that train! expects two different references to the model's parameters (via loss and params) is a weird feature of this "implicit" interface. We're trying to kill it... you have Flux v0.13.9 which already supports the new way, https://fluxml.ai/Flux.jl/previews/PR2114/training/training/ is roughly the upgrade guide.

gortibaldik commented 1 year ago

That guide is a bit too condensed for me, and I'm not too sure I catched the main gist.

Now I understand, that under the "implicit style" I should also define gpu_loss(X, y) = crossentropy(gpu_net(X), y)

What is recommended under the "explicit style" ? Do I understand correctly, that the train_model function in my code should be rewritten in this way?

function train_model!(net, X, y;
    loss=crossentropy,
    opt = Descent(0.1),
    batchsize=128,
    n_epochs=10,
    file_name=""
)
    batches = DataLoader((X, y); batchsize, shuffle=true)
    opt_state = Flux.setup(opt, net)
    for current_epoch in 1:n_epochs
        Flux.train!(net, batches, opt_state) do m, x, y
            loss(m(x), y)
        end
    end

    # save the model
    !isempty(file_name) && BSON.bson(file_name, net=net)
end

Is there another best-practice to which I do not adhere ? Thank you :+1:

mcabbott commented 1 year ago

Yes that looks right to me. But I didn't run it, I hope it works!

FluxML / Flux.jl