It seem like GPUArrays does not know how to handle a triangular matmul.
We should either add generic_trimatmul to GPUArrays or teach it to fall back to a normal matmul for those cases.
Falling back to matmul is faster in most cases than implementing a generic_trimatmul kernel since matmul often has a matrix matrix multiplication implementation, that is more performant than generic_trimatmul.
julia> A = UpperTriangular(MtlMatrix(rand(Float32, 1024, 1024)))
julia> x = mtl(rand(1024))
julia> A * x
ERROR: ArgumentError: cannot take the CPU address of a MtlMatrix{Float32, Private}
Stacktrace:
[1] unsafe_convert(::Type{Ptr{Float32}}, x::MtlMatrix{Float32, Private})
@ Metal ~/Developer/Metal.jl/src/array.jl:197
[2] trmv!(uplo::Char, trans::Char, diag::Char, A::MtlMatrix{Float32, Private}, x::MtlVector{Float32, Private})
@ LinearAlgebra.BLAS ~/.julia/juliaup/julia-1.10.2+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/LinearAlgebra/src/blas.jl:1315
[3] generic_trimatmul!(c::MtlVector{…}, uploc::Char, isunitc::Char, tfun::Function, A::MtlMatrix{…}, b::MtlVector{…})
@ LinearAlgebra ~/.julia/juliaup/julia-1.10.2+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/LinearAlgebra/src/triangular.jl:823
[4] _trimul!(C::MtlVector{Float32, Private}, A::UpperTriangular{Float32, MtlMatrix{…}}, B::MtlVector{Float32, Private})
@ LinearAlgebra ~/.julia/juliaup/julia-1.10.2+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/LinearAlgebra/src/triangular.jl:705
[5] mul!(C::MtlVector{Float32, Private}, A::UpperTriangular{Float32, MtlMatrix{…}}, B::MtlVector{Float32, Private})
@ LinearAlgebra ~/.julia/juliaup/julia-1.10.2+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/LinearAlgebra/src/triangular.jl:690
[6] *(A::UpperTriangular{Float32, MtlMatrix{Float32, Private}}, B::MtlVector{Float32, Private})
@ LinearAlgebra ~/.julia/juliaup/julia-1.10.2+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/LinearAlgebra/src/triangular.jl:1471
[7] top-level scope
@ REPL[18]:1
Some type information was truncated. Use `show(err)` to see complete types.
It seem like GPUArrays does not know how to handle a triangular matmul.
We should either add
generic_trimatmul
to GPUArrays or teach it to fall back to a normal matmul for those cases.Falling back to matmul is faster in most cases than implementing a
generic_trimatmul
kernel since matmul often has a matrix matrix multiplication implementation, that is more performant thangeneric_trimatmul
.