JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
Other
626 stars 67 forks source link

Matrix Transposition indexing in DArray in CUDA #543

Open TheFibonacciEffect opened 1 month ago

TheFibonacciEffect commented 1 month ago

I was following along the https://github.com/jpsamaroo/DaggerWorkshop2024 and noticed that matrix transposition does not seem to work on NVIDIA GPUs for me.

Sorry if this is a bit brief, ask questions if there is something missing.

julia> scope
UnionScope:
  ExactScope: processor == CuArrayDeviceProc(worker 1, device 0, uuid b8c8a4da-6ec1-2a9e-fda1-1e5e12ba47f1)
Dagger.with_options(;scope) do
           # Allocated directly on the GPU
           DA = rand(AutoBlocks(), Float32, 64, 64)

           # Broadcast is GPU-compatible
           DB = DA .* 3f0

           # Matmul is no problem!
           DC = DB * DB'

           # Finally, any map-reduce algorithm is easy enough
           # sum(DC; dims=1)
       end

The error is

ERROR: DTaskFailedException:
  Root Exception Type: ErrorException
  Root Exception:
Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores)
Environment:
  JULIA_DEBUG =
⌃ [052768ef] CUDA v5.4.2
  [d58978e5] Dagger v0.18.12 `https://github.com/JuliaParallel/Dagger.jl.git#jps/workshop-2024`
  [68e73e28] DaggerGPU v0.2.0 `https://github.com/JuliaGPU/DaggerGPU.jl.git#master`

PS: Thanks for the talk and enjoy the conference @jpsamaroo

jpsamaroo commented 1 month ago

Hi @TheFibonacciEffect ! Thanks for the feedback - can you provide the full stacktrace that you get?

TheFibonacciEffect commented 1 month ago

Sure: Here is the program again:

using Dagger
# All GPU users - run this!
using DaggerGPU

# Annoying, but we need to restart the scheduler for the below changes to take effect...
# Will be fixed in future versions of Dagger!
Dagger.cancel!(;halt_sch=true)

# And we'll setup some defaults, just in case you don't have a GPU, but want to run the examples
GPUArray = Array
scope = Dagger.scope(;worker=1, threads=:)
# NVIDIA GPU users - run this!
using CUDA

# Make sure that we have at least one GPU
@assert length(CUDA.devices()) > 0 "You don't have any NVIDIA GPUs!"

# Pick the first available GPU
GPUArray = CuArray
scope = Dagger.scope(;cuda_gpu=1)

And here is the full stacktrace:

ERROR: DTaskFailedException:
  Root Exception Type: ErrorException
  Root Exception:
Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:116
  [5] getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:48 [inlined]
  [6] scalar_getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:34 [inlined]
  [7] _getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:17 [inlined]
  [8] getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:15 [inlined]
  [9] getindex
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/adjtrans.jl:334 [inlined]
 [10] getindex
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:265 [inlined]
 [11] _getindex
    @ ./abstractarray.jl:1361 [inlined]
 [12] getindex
    @ ./abstractarray.jl:1315 [inlined]
 [13] iterate
    @ ./abstractarray.jl:1212 [inlined]
 [14] iterate
    @ ./abstractarray.jl:1210 [inlined]
 [15] copyto_unaliased!(deststyle::IndexLinear, dest::CuArray{…}, srcstyle::IndexCartesian, src::LinearAlgebra.LowerTriangular{…})
    @ Base ./abstractarray.jl:1086
 [16] copyto!
    @ ./abstractarray.jl:1061 [inlined]
 [17] +(A::LinearAlgebra.LowerTriangular{Float32, LinearAlgebra.Adjoint{…}}, B::LinearAlgebra.UpperTriangular{Float32, CuArray{…}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:747
 [18] copydiagtile!(A::CuArray{Float32, 2, CUDA.DeviceMemory}, uplo::Char)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:403
 [19] #invokelatest#2
    @ ./essentials.jl:1043 [inlined]
 [20] invokelatest
    @ ./essentials.jl:1040 [inlined]
 [21] (::CUDAExt.var"#26#27"{@Kwargs{}, CUDAExt.CuArrayDeviceProc, typeof(Dagger.copydiagtile!), Tuple{…}, @NamedTuple{…}})()
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:275
Stacktrace:
  [1] wait(t::Task)
    @ Base ./task.jl:370
  [2] fetch
    @ ./task.jl:390 [inlined]
  [3] execute!(::CUDAExt.CuArrayDeviceProc, ::Any, ::Any, ::Vararg{Any}; kwargs...)
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:281
  [4] execute!(::CUDAExt.CuArrayDeviceProc, ::Any, ::Any, ::Vararg{Any})
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:269
  [5] #169
    @ ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1659 [inlined]
  [6] #21
    @ ~/.julia/packages/Dagger/aVKft/src/options.jl:18 [inlined]
  [7] with(::Dagger.var"#21#22"{Dagger.Sch.var"#169#177"{…}}, ::Pair{Base.ScopedValues.ScopedValue{…}, @NamedTuple{…}})
    @ Base.ScopedValues ./scopedvalues.jl:267
  [8] with_options(f::Dagger.Sch.var"#169#177"{CUDAExt.CuArrayDeviceProc, Vector{Pair{Symbol, Any}}, Vector{Any}}, options::@NamedTuple{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:17
  [9] do_task(to_proc::CUDAExt.CuArrayDeviceProc, task_desc::Vector{Any})
    @ Dagger.Sch ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1657
 [10] (::Dagger.Sch.var"#145#153"{UInt64, UInt32, Dagger.Sch.ProcessorInternalState, Distributed.RemoteChannel{Channel{Any}}, CUDAExt.CuArrayDeviceProc})()
    @ Dagger.Sch ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1333
  This Task:  DTask(id=8, Dagger.Chunk{typeof(Dagger.copydiagtile!), MemPool.DRef, OSProc, UnionScope}(typeof(Dagger.copydiagtile!), UnitDomain(), MemPool.DRef(1, 33, 0x0000000000000000), OSProc(1), UnionScope:
  ExactScope: processor == CuArrayDeviceProc(worker 1, device 0, uuid 77b44642-e0a6-ba49-8489-f70e83dde7f7), false)(Dagger.WeakChunk(1, 17, WeakRef(Dagger.Chunk{CuArray{Float32, 2, CUDA.DeviceMemory}, MemPool.DRef, CUDAExt.CuArrayDeviceProc, AnyScope}(CuArray{Float32, 2, CUDA.DeviceMemory}, ArrayDomain{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}((1:64, 1:64)), MemPool.DRef(1, 17, 0x0000000000004000), CuArrayDeviceProc(worker 1, device 0, uuid 77b44642-e0a6-ba49-8489-f70e83dde7f7), AnyScope(), false))), U))
Stacktrace:
  [1] fetch(t::Dagger.ThunkFuture; proc::OSProc, raw::Bool)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/dtask.jl:17
  [2] fetch
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:12 [inlined]
  [3] #fetch#76
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:72 [inlined]
  [4] fetch
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:68 [inlined]
  [5] wait_all(f::Function; check_errors::Bool)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/queue.jl:100
  [6] wait_all
    @ ~/.julia/packages/Dagger/aVKft/src/queue.jl:95 [inlined]
  [7] #spawn_datadeps#254
    @ ~/.julia/packages/Dagger/aVKft/src/datadeps.jl:942 [inlined]
  [8] spawn_datadeps
    @ ~/.julia/packages/Dagger/aVKft/src/datadeps.jl:934 [inlined]
  [9] copytri!
    @ ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:363 [inlined]
 [10] syrk_dagger!(C::DMatrix{…}, trans::Char, A::DMatrix{…}, _add::LinearAlgebra.MulAddMul{…})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:351
 [11] (::Dagger.var"#661#665"{Char, LinearAlgebra.MulAddMul{…}})(C::DMatrix{Float32, Blocks{…}, typeof(cat)}, A::DMatrix{Float32, Blocks{…}, typeof(cat)})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:18
 [12] maybe_copy_buffered(::Function, ::Pair{DMatrix{Float32, Blocks{…}, typeof(cat)}, Blocks{2}}, ::Vararg{Pair{DMatrix{…}, Blocks{…}}})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/copy.jl:8
 [13] generic_matmatmul!(C::DMatrix{…}, transA::Char, transB::Char, A::DMatrix{…}, B::DMatrix{…}, _add::LinearAlgebra.MulAddMul{…})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:17
 [14] _mul!
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:287 [inlined]
 [15] mul!
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:285 [inlined]
 [16] mul!(C::DMatrix{Float32, Blocks{…}, typeof(cat)}, A::DMatrix{Float32, Blocks{…}, typeof(cat)}, B::LinearAlgebra.Adjoint{Float32, DMatrix{…}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:253
 [17] *(A::DMatrix{Float32, Blocks{2}, typeof(cat)}, B::LinearAlgebra.Adjoint{Float32, DMatrix{Float32, Blocks{2}, typeof(cat)}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:114
 [18] (::var"#3#4")()
    @ Main ./REPL[16]:9
 [19] #21
    @ ~/.julia/packages/Dagger/aVKft/src/options.jl:18 [inlined]
 [20] with(::Dagger.var"#21#22"{var"#3#4"}, ::Pair{Base.ScopedValues.ScopedValue{NamedTuple}, @NamedTuple{scope::UnionScope}})
    @ Base.ScopedValues ./scopedvalues.jl:267
 [21] with_options(f::var"#3#4", options::@NamedTuple{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:17
 [22] with_options(f::Function; options::@Kwargs{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:21
 [23] top-level scope
    @ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
TheFibonacciEffect commented 1 month ago
julia> show(err)
1-element ExceptionStack:
DTaskFailedException:
  Root Exception Type: ErrorException
  Root Exception:
Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:116
  [5] getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:48 [inlined]
  [6] scalar_getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:34 [inlined]
  [7] _getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:17 [inlined]
  [8] getindex
    @ ~/.julia/packages/GPUArrays/8Y80U/src/host/indexing.jl:15 [inlined]
  [9] getindex
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/adjtrans.jl:334 [inlined]
 [10] getindex
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:265 [inlined]
 [11] _getindex
    @ ./abstractarray.jl:1361 [inlined]
 [12] getindex
    @ ./abstractarray.jl:1315 [inlined]
 [13] iterate
    @ ./abstractarray.jl:1212 [inlined]
 [14] iterate
    @ ./abstractarray.jl:1210 [inlined]
 [15] copyto_unaliased!(deststyle::IndexLinear, dest::CuArray{Float32, 2, CUDA.DeviceMemory}, srcstyle::IndexCartesian, src::LinearAlgebra.LowerTriangular{Float32, LinearAlgebra.Adjoint{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}})
    @ Base ./abstractarray.jl:1086
 [16] copyto!
    @ ./abstractarray.jl:1061 [inlined]
 [17] +(A::LinearAlgebra.LowerTriangular{Float32, LinearAlgebra.Adjoint{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}}, B::LinearAlgebra.UpperTriangular{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:747
 [18] copydiagtile!(A::CuArray{Float32, 2, CUDA.DeviceMemory}, uplo::Char)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:403
 [19] #invokelatest#2
    @ ./essentials.jl:1043 [inlined]
 [20] invokelatest
    @ ./essentials.jl:1040 [inlined]
 [21] (::CUDAExt.var"#26#27"{@Kwargs{}, CUDAExt.CuArrayDeviceProc, typeof(Dagger.copydiagtile!), Tuple{CuArray{Float32, 2, CUDA.DeviceMemory}, Char}, @NamedTuple{sch_uid::UInt64, sch_handle::Dagger.Sch.SchedulerHandle, processor::CUDAExt.CuArrayDeviceProc, task_spec::Vector{Any}}})()
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:275
Stacktrace:
  [1] wait(t::Task)
    @ Base ./task.jl:370
  [2] fetch
    @ ./task.jl:390 [inlined]
  [3] execute!(::CUDAExt.CuArrayDeviceProc, ::Any, ::Any, ::Vararg{Any}; kwargs...)
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:281
  [4] execute!(::CUDAExt.CuArrayDeviceProc, ::Any, ::Any, ::Vararg{Any})
    @ CUDAExt ~/.julia/packages/DaggerGPU/Kt3Ax/ext/CUDAExt.jl:269
  [5] #169
    @ ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1659 [inlined]
  [6] #21
    @ ~/.julia/packages/Dagger/aVKft/src/options.jl:18 [inlined]
  [7] with(::Dagger.var"#21#22"{Dagger.Sch.var"#169#177"{CUDAExt.CuArrayDeviceProc, Vector{Pair{Symbol, Any}}, Vector{Any}}}, ::Pair{Base.ScopedValues.ScopedValue{NamedTuple}, @NamedTuple{scope::UnionScope}})
    @ Base.ScopedValues ./scopedvalues.jl:267
  [8] with_options(f::Dagger.Sch.var"#169#177"{CUDAExt.CuArrayDeviceProc, Vector{Pair{Symbol, Any}}, Vector{Any}}, options::@NamedTuple{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:17
  [9] do_task(to_proc::CUDAExt.CuArrayDeviceProc, task_desc::Vector{Any})
    @ Dagger.Sch ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1657
 [10] (::Dagger.Sch.var"#145#153"{UInt64, UInt32, Dagger.Sch.ProcessorInternalState, Distributed.RemoteChannel{Channel{Any}}, CUDAExt.CuArrayDeviceProc})()
    @ Dagger.Sch ~/.julia/packages/Dagger/aVKft/src/sch/Sch.jl:1333
  This Task:  DTask(id=8, Dagger.Chunk{typeof(Dagger.copydiagtile!), MemPool.DRef, OSProc, UnionScope}(typeof(Dagger.copydiagtile!), UnitDomain(), MemPool.DRef(1, 33, 0x0000000000000000), OSProc(1), UnionScope:
  ExactScope: processor == CuArrayDeviceProc(worker 1, device 0, uuid 77b44642-e0a6-ba49-8489-f70e83dde7f7), false)(Dagger.WeakChunk(1, 17, WeakRef(Dagger.Chunk{CuArray{Float32, 2, CUDA.DeviceMemory}, MemPool.DRef, CUDAExt.CuArrayDeviceProc, AnyScope}(CuArray{Float32, 2, CUDA.DeviceMemory}, ArrayDomain{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}((1:64, 1:64)), MemPool.DRef(1, 17, 0x0000000000004000), CuArrayDeviceProc(worker 1, device 0, uuid 77b44642-e0a6-ba49-8489-f70e83dde7f7), AnyScope(), false))), U))
Stacktrace:
  [1] fetch(t::Dagger.ThunkFuture; proc::OSProc, raw::Bool)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/dtask.jl:17
  [2] fetch
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:12 [inlined]
  [3] #fetch#76
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:72 [inlined]
  [4] fetch
    @ ~/.julia/packages/Dagger/aVKft/src/dtask.jl:68 [inlined]
  [5] wait_all(f::Function; check_errors::Bool)
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/queue.jl:100
  [6] wait_all
    @ ~/.julia/packages/Dagger/aVKft/src/queue.jl:95 [inlined]
  [7] #spawn_datadeps#254
    @ ~/.julia/packages/Dagger/aVKft/src/datadeps.jl:942 [inlined]
  [8] spawn_datadeps
    @ ~/.julia/packages/Dagger/aVKft/src/datadeps.jl:934 [inlined]
  [9] copytri!
    @ ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:363 [inlined]
 [10] syrk_dagger!(C::DMatrix{Float32, Blocks{2}, typeof(cat)}, trans::Char, A::DMatrix{Float32, Blocks{2}, typeof(cat)}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:351
 [11] (::Dagger.var"#661#665"{Char, LinearAlgebra.MulAddMul{true, true, Bool, Bool}})(C::DMatrix{Float32, Blocks{2}, typeof(cat)}, A::DMatrix{Float32, Blocks{2}, typeof(cat)})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:18
 [12] maybe_copy_buffered(::Function, ::Pair{DMatrix{Float32, Blocks{2}, typeof(cat)}, Blocks{2}}, ::Vararg{Pair{DMatrix{Float32, Blocks{2}, typeof(cat)}, Blocks{2}}})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/copy.jl:8
 [13] generic_matmatmul!(C::DMatrix{Float32, Blocks{2}, typeof(cat)}, transA::Char, transB::Char, A::DMatrix{Float32, Blocks{2}, typeof(cat)}, B::DMatrix{Float32, Blocks{2}, typeof(cat)}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/array/mul.jl:17
 [14] _mul!
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:287 [inlined]
 [15] mul!
    @ ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:285 [inlined]
 [16] mul!(C::DMatrix{Float32, Blocks{2}, typeof(cat)}, A::DMatrix{Float32, Blocks{2}, typeof(cat)}, B::LinearAlgebra.Adjoint{Float32, DMatrix{Float32, Blocks{2}, typeof(cat)}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:253
 [17] *(A::DMatrix{Float32, Blocks{2}, typeof(cat)}, B::LinearAlgebra.Adjoint{Float32, DMatrix{Float32, Blocks{2}, typeof(cat)}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:114
 [18] (::var"#3#4")()
    @ Main ./REPL[16]:9
 [19] #21
    @ ~/.julia/packages/Dagger/aVKft/src/options.jl:18 [inlined]
 [20] with(::Dagger.var"#21#22"{var"#3#4"}, ::Pair{Base.ScopedValues.ScopedValue{NamedTuple}, @NamedTuple{scope::UnionScope}})
    @ Base.ScopedValues ./scopedvalues.jl:267
 [21] with_options(f::var"#3#4", options::@NamedTuple{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:17
 [22] with_options(f::Function; options::@Kwargs{scope::UnionScope})
    @ Dagger ~/.julia/packages/Dagger/aVKft/src/options.jl:21
 [23] top-level scope
    @ REPL[16]:1
TheFibonacciEffect commented 1 month ago

Manifest.toml.txt

This is the current Manifest.toml file (github only allows .txt files, so I added that suffix)

TheFibonacciEffect commented 1 month ago

And here is some additional information about the GPU I am using:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro RTX 4000"
  CUDA Driver Version / Runtime Version          11.4 / 11.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7960 MBytes (8346533888 bytes)
MapSMtoCores for SM 7.5 is undefined.  Default to use 64 Cores/SM
MapSMtoCores for SM 7.5 is undefined.  Default to use 64 Cores/SM
  (036) Multiprocessors, (064) CUDA Cores/MP:    2304 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             6501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS
TheFibonacciEffect commented 1 month ago

Thanks a lot again, if there is more information needed, just ask : )

jpsamaroo commented 1 month ago

Thanks for the info! I'm still traveling home to the US, but I'll plan to take a look at this again this week.

jpsamaroo commented 1 month ago

Ok, this happens because we internally do some UpperTriangular(A)' + UpperTriangular(A), where we should really use .+ instead to ensure GPU support. I'm putting together a branch with this and a few other fixes, and will validate that it works locally with AMDGPU.jl (as that's what I've got on my laptop), then I'll post it so you can validate that it works on your system too.

TheFibonacciEffect commented 1 month ago

Perfect, thanks a lot : )