JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
226 stars 18 forks source link

Inspect chosen block sizes #184

Closed JohannesNaegele closed 9 months ago

JohannesNaegele commented 11 months ago

Hey, I was just wondering whether there is an easy way to see which block sizes Octavian chooses for matmul!(C, A, B). I digged a bit and saw solve_block_sizes but found it a bit tedious to understand it's design.

chriselrod commented 11 months ago

Hey, I was just wondering whether there is an easy way to see which block sizes Octavian chooses for matmul!(C, A, B). I digged a bit and saw solve_block_sizes but found it a bit tedious to understand it's design.

I don't think the solve_block_sizes code is particularly great, at least, trying to parameterize it by the four parameters I chosen doesn't seem to work that well.

That said,

julia> using Octavian

julia> Octavian.block_sizes(Val(Float64))
(static(432), static(904), static(7194))

julia> Octavian.block_sizes(Val(Float32))
(static(864), static(904), static(14388))

julia> versioninfo()
Julia Version 1.9.3-DEV.0
Commit 6fc1be04ee (2023-07-06 14:55 UTC)
Platform Info:
  OS: Linux (aarch64-unknown-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
julia> using Octavian

julia> Octavian.block_sizes(Val(Float64))
(static(720), static(144), static(891))

julia> Octavian.block_sizes(Val(Float32))
(static(1440), static(144), static(1782))

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 28 on 28 virtual cores

I'm a bit skeptical looking at those numbers. The M1 doesn't seem to care which sizes I give it, or perhaps I have a bug in the tilesize script.

While for Skylake-X, Kc looks too small relative to Mc, so I should probably check to make sure the parameters are okay.

Anyway, this function returns Mc, Kc, Nc. If we're multiplying column-major C = A*B, where A is MxK, B is KxN, and C is MxN, then Mc, Kc, and Nc are the tile sizes along each of M, K, and N.

JohannesNaegele commented 11 months ago

Thank you so much! So far my understanding is that Octavian does not use packing and uses LoopVectorization for the microkernel. My machine gives me:

julia> using Octavian

julia> Octavian.block_sizes(Val(Float64))
(static(32), static(260), static(840))

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

Therefore, I thought this should in the end correspond to sth. like:

using Octavian
using BenchmarkTools
using LoopVectorization

A = rand(2000,2000)
B = rand(2000,2000)
C = rand(2000,2000)

# general matrix multiplikation
function gemm_kernel!(C, A, B)
    @turbo for n ∈ axes(A, 1), m ∈ axes(B, 2)
        Cₙₘ = zero(eltype(C))
        for k ∈ axes(A, 2)
            Cₙₘ += A[n,k] * B[k,m]
        end
        C[n,m] = Cₙₘ
    end
end

# general matrix multiplikation with addition onto the result matrix
function add_gemm_kernel!(C, A, B)
    @turbo for n ∈ axes(A, 1), m ∈ axes(B, 2)
        Cₙₘ = zero(eltype(C))
        for k ∈ axes(A, 2)
            Cₙₘ += A[n,k] * B[k,m]
        end
        C[n,m] += Cₙₘ
    end
end

function blocked_kernel!(C, A, B)
    sₘ = 32
    sₙ = 840
    sₖ = 260
    m, k = size(A)
    n = size(B)[2]
    @assert ((m, n) == size(C)) "Dimension mismatch!"
    for bₙ in 1:sₙ:n, bₘ in 1:sₘ:m, bₖ in 1:sₖ:k
        endₘ = min(m, bₘ + sₘ - 1)
        endₙ = min(n, bₙ + sₙ - 1)
        endₖ = min(k, bₖ + sₖ - 1)
        @inbounds @views begin
            block_A = A[bₘ:endₘ, bₖ:endₖ]
            block_B = B[bₖ:endₖ, bₙ:endₙ]
            block_C = C[bₘ:endₘ, bₙ:endₙ]
        end
        if bₖ == 1
            gemm_kernel!(block_C, block_A, block_B)
        else
            add_gemm_kernel!(block_C, block_A, block_B)
        end
    end
end

@btime blocked_kernel!($C, $A, $B) # 351.434 ms
@btime Octavian.matmul!($C, $A, $B) # 337.672 ms

As you can see, Octavian is still significantly faster. What is the catch?

chriselrod commented 11 months ago

So far my understanding is that Octavian does not use packing As you can see, Octavian is still significantly faster. What is the catch?

It does use packing.

Also, try updating to the latest Octavian release. That should improve Octavian's performance further. Notice the far larger block sizes I reported for the M1 than you reported. Of course, that could also be a bug in detecting Apple silicon on MacOS vs Asahi.

JohannesNaegele commented 11 months ago

Indeed this yields larger block sizes, however the performance is rather worse:

julia> using Octavian

julia> using BenchmarkTools

julia> A = rand(2000,2000); B = rand(2000,2000); C = rand(2000,2000);

julia> Octavian.block_sizes(Val(Float64))
(static(856), static(456), static(204))

julia> @btime C .= A*B;
  162.493 ms (4 allocations: 30.52 MiB)

julia> @btime Octavian.matmul!($C, $A, $B);
  345.716 ms (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

(...) pkg> status Octavian
Status `~/.../Project.toml`
  [6fd5a793] Octavian v0.3.26

In general I am just trying to find out whether there is a simple way of writing down the Octavian matmul (similar to what I did above) if the block sizes are already specified.

chriselrod commented 11 months ago

Indeed this yields larger block sizes, however the performance is rather worse:

Oops. I do not think my attempts at tuning where working correctly.

In general I am just trying to find out whether there is a simple way of writing down the Octavian matmul (similar to what I did above) if the block sizes are already specified.

Octavian switches to a "tile major" order, which makes things a bit more difficult, but you could stick with column-major packing, i.e. allocate a McxKc matrix, and then copy block_A into it before calling the kernel.

You'll want to avoid heap allocations, so you can either allocate a global and use that, or try to stack allocate it.

JohannesNaegele commented 11 months ago

Oops. I do not think my attempts at tuning where working correctly.

lol

Octavian switches to a "tile major" order, which makes things a bit more difficult, but you could stick with column-major packing, i.e. allocate a McxKc matrix, and then copy block_A into it before calling the kernel.

I have no idea what tile major order means. Where can I find docs/material regarding this technique?

You'll want to avoid heap allocations, so you can either allocate a global and use that, or try to stack allocate it.

Yeah, e.g. this StrideArray based "implementation" is slighty faster than Octavian:

function stack_kernel!(C, A, B)
    sₘ = 16
    sₙ = 32 * 18
    sₖ = 32 * 16 - 8 - 4
    m, k = size(A)
    n = size(B)[2]
    @assert ((m, n) == size(C)) "Dimension mismatch!"
    block_A = StrideArray{eltype(A)}(undef, (sₘ, sₖ))
    for bₙ in 1:sₙ:n
        endₙ = min(n, bₙ + sₙ - 1)
        for bₖ in 1:sₖ:k
            endₖ = min(k, bₖ + sₖ - 1)
            lengthₖ = endₖ - bₖ + 1
            for bₘ in 1:sₘ:m
                endₘ = min(m, bₘ + sₘ - 1)
                lengthₘ = endₘ - bₘ + 1
                @turbo for i in 1:lengthₘ
                    for j in 1:lengthₖ
                        block_A[i, j] = A[bₘ + i - 1, bₖ + j - 1]
                    end
                end
                @inbounds @views begin
                    block_B = B[bₖ:endₖ, bₙ:endₙ]
                    block_C = C[bₘ:endₘ, bₙ:endₙ]
                end
                if bₖ == 1
                    gemm_kernel!(block_C, block_A, block_B)
                else
                    add_gemm_kernel!(block_C, block_A, block_B)
                end
            end
        end
    end
end

@btime stack_kernel!($C, $A, $B) # 333.025 ms (2 allocations: 62.55 KiB)
chriselrod commented 11 months ago

Oops. I do not think my attempts at tuning where working correctly.

lol

I've found the problem and am now rerunning the optimizer. I'll probably run it overnight, but I'll have initial results in a few hours.

I have no idea what tile major order means. Where can I find docs/material regarding this technique?

julia> Mr,Nr = Octavian.matmul_params(Val(Float64))
(static(3), static(9))

julia> Octavian.pick_vector_width(Float64)*Mr, Nr
(static(24), static(9))

julia> @assert sₘ % Mr == 0

julia> block_A = StrideArray{eltype(A)}(undef, (Mr, sₖ, sₘ÷Mr))

you'll then have to update both the packing into block_A and the matrix multiply kernel accordingly.

The benefit of this is that block_A should now be traversed in memory order, allowing it to take advantage of the the streaming prefetchers. This should help at least a little.

You'll also want to specialize the kernels on the size, i.e. make sure Mr is known at compile time, to get rid of useless cleanup loops, etc.