linear algebra directly from an `SnpArray`

OpenMendel / SnpArrays.jl

Compressed storage for SNP data

https://openmendel.github.io/SnpArrays.jl/latest

Other

44 stars 9 forks source link

linear algebra directly from an `SnpArray` #51

Closed kose-y closed 4 years ago

kose-y commented 5 years ago

I think this is doable and operations based on 8-bit integers are more easily translated to other targets (e.g. GPUs) than the current BitMatrix-based version.

kose-y commented 4 years ago

LoopVectorization.jl would be useful for this.

Example: Matrix-vector multiplication r = Av where A[i,j] = a[i] b[j] I(i <= j) with LoopVectorization.jl v0.6.21

versioninfo()

Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

using LoopVectorization
using BenchmarkTools

function matvec_basic!(out, a, b, v)
    m = length(a)
    @inbounds @simd for i in 1:m
        outi = zero(eltype(a))
        for j in 1:m
            if i <= j
                outi += v[j] * a[i] * b[j]
            end
        end
        out[i] = outi
    end

    return out
end

matvec_basic! (generic function with 1 method)

function matvec_avx!(out, a, b, v)
    m = length(a)
    # direct operations on indices are not supported
    indarr = collect(1:m) 
    @avx for i in 1:m
        outi = zero(eltype(a))
        for j in 1:m
            outi += ifelse(indarr[i] <= indarr[j], 
                v[j] * a[i] * b[j], zero(eltype(a)))
        end
        out[i] = outi
    end

    return out
end

matvec_avx! (generic function with 1 method)

sz = 500
a = randn(sz); b = randn(sz); v = randn(sz)
out_basic = randn(sz); out_avx = randn(sz);

isapprox(matvec_avx!(out_avx, a, b, v), matvec_basic!(out_basic, a, b, v))

true

@benchmark matvec_avx!($out_avx, $a, $b, $v)

BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     84.622 μs (0.00% GC)
  median time:      95.164 μs (0.00% GC)
  mean time:        96.968 μs (0.00% GC)
  maximum time:     167.482 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

@benchmark matvec_basic!($out_basic, $a, $b, $v)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     213.102 μs (0.00% GC)
  median time:      227.754 μs (0.00% GC)
  mean time:        231.575 μs (0.00% GC)
  maximum time:     322.697 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The implementation with LoopVectorization is much faster. However, it should be used with many cautions, as it is still in rapid development. Some Julia syntaxes are unsupported (e.g. if-else statement), and unexpected numerical bugs might arise. For example, changing outi += ifelse(indarr[i] <= indarr[j], v[j] * a[i] * b[j], zero(eltype(a))) to outi += ifelse(i <= j, v[j] * a[i] * b[j], zero(eltype(a))) simply gives incorrect result without any warning.

kose-y commented 4 years ago

https://github.com/chriselrod/LoopVectorization.jl/commit/a361f17b3c40b15f157cc3728b4841f6411848a2#diff-bbea47f98f8abf295b079cebb5da4fe7

It now supports some BitMatrix operations. (added less than 24 hours ago)

biona001 commented 4 years ago

This is a related issue. For optimal performance, we need to fill in the 2 TODOs when matrix size not a multiple of 512 and also do the unrolling as mentioned in that post.

using VectorizationBase: gesp, stridedpointer
function gemv_avx_512by512!(c, A, b)
    @avx for j in 1:512, i in 1:512
        c[i] += A[i, j] * b[j]
    end
end
function gemv_tile!(c, A, b)
    M, N = size(A)
    Miter = M >>> 9 # fast div(M, 512)
    Mrem = M & 511 # fast rem(M, 512)
    Niter = N >>> 9
    Nrem = N & 511
    GC.@preserve c A b for n in 0:Niter-1
        for m in 0:Miter-1
            gemv_avx_512by512!(
                gesp(stridedpointer(c), (512m,)),
                gesp(stridedpointer(A), (8m, 8n)),
                gesp(stridedpointer(b), (512n,))
            )
        end
        # TODO: handle mrem
    end
    # TODO: handle nrem
end

I was going to tackle it but you are much better at this, would you want to give it a shot? Except this is still within the regime of converting the SnpArray into a SnpBitMatrix, so perhaps this is not all that related

chriselrod commented 4 years ago

For example, changing outi += ifelse(indarr[i] <= indarr[j], v[j] a[i] b[j], zero(eltype(a))) to outi += ifelse(i <= j, v[j] a[i] b[j], zero(eltype(a))) simply gives incorrect result without any warning.

Please file issues when you notice problems! Some bugs are straightforward to fix -- I just need to be aware of them. It will work on LoopVectorization 0.6.27 when it's released in about an hour after making this post.

I assume there's a reason you didn't just make your inner loop for j in i:m and get rid of the if statement?

using LoopVectorization, BenchmarkTools
function matvec_avx!(out, a, b, v)
    m = length(a)
    @avx for i in 1:m
        outi = zero(eltype(a))
        for j in 1:m
            outi += ifelse(i <= j,
                v[j] * a[i] * b[j], zero(eltype(a)))
        end
         out[i] = outi
    end
    return out
end
function matvec_basic!(out, a, b, v)
    m = length(a)
    @inbounds @simd for i in 1:m
         outi = zero(eltype(a))
         for j in 1:m
             if i <= j
                 outi += v[j] * a[i] * b[j]
             end
         end
         out[i] = outi
     end
    return out
end
function matvec_basic_iv!(out, a, b, v)
    m = length(a)
    @inbounds for i in 1:m
         outi = zero(eltype(a))
         @simd for j in i:m
             outi += v[j] * a[i] * b[j]
         end
         out[i] = outi
     end
    return out
end
function matvec_avx_iv!(out, a, b, v)
    m = length(a)
    @inbounds for i in 1:m
         outi = zero(eltype(a))
         @avx for j in i:m
             outi += v[j] * a[i] * b[j]
         end
         out[i] = outi
     end
    return out
end

sz = 500
a = randn(sz); b = randn(sz); v = randn(sz);
out_basic = randn(sz);
out_avx = randn(sz);
out_basic_iv = randn(sz);
out_avx_iv = randn(sz);

matvec_avx!(out_avx, a, b, v) ≈ matvec_basic!(out_basic, a, b, v)
matvec_basic_iv!(out_basic_iv, a, b, v) ≈ out_basic
matvec_avx_iv!(out_avx_iv, a, b, v) ≈ out_basic

@benchmark matvec_avx!($out_avx, $a, $b, $v)
@benchmark matvec_basic!($out_basic, $a, $b, $v)
@benchmark matvec_avx_iv!($out_avx_iv, $a, $b, $v)
@benchmark matvec_basic_iv!($out_basic_iv, $a, $b, $v)

This yields:


julia> matvec_avx!(out_avx, a, b, v) ≈ matvec_basic!(out_basic, a, b, v)
true

julia> matvec_basic_iv!(out_basic_iv, a, b, v) ≈ out_basic
true

julia> matvec_avx_iv!(out_avx_iv, a, b, v) ≈ out_basic
true

julia> @benchmark matvec_avx!($out_avx, $a, $b, $v)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.449 μs (0.00% GC)
  median time:      16.464 μs (0.00% GC)
  mean time:        16.487 μs (0.00% GC)
  maximum time:     31.922 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark matvec_basic!($out_basic, $a, $b, $v)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     163.726 μs (0.00% GC)
  median time:      163.804 μs (0.00% GC)
  mean time:        165.766 μs (0.00% GC)
  maximum time:     214.120 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark matvec_avx_iv!($out_avx_iv, $a, $b, $v)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.923 μs (0.00% GC)
  median time:      9.955 μs (0.00% GC)
  mean time:        9.970 μs (0.00% GC)
  maximum time:     23.192 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark matvec_basic_iv!($out_basic_iv, $a, $b, $v)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.941 μs (0.00% GC)
  median time:      16.034 μs (0.00% GC)
  mean time:        16.055 μs (0.00% GC)
  maximum time:     27.664 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

As @biona001 notes, you'll get better performance with extra tiling.

kose-y commented 4 years ago

@chriselrod Thanks for coming here. I'm impressed with the quick development of this package. My actual computation is block-triangular (I see my loop is not the best way). There are a couple of more issues I found, and I will file them soon.

kose-y commented 4 years ago

These work for additive model, no centering or scaling. About twice slower than SnpBitMatrix, but memory efficient (no allocation of two Bitmatrixes), with or without @avx macro.

function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T 
    for i ∈ eachindex(y)
        yi = zero(eltype(y))
        for j ∈ eachindex(x)
            Aij = A[i, j]
            yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
        y[i] = yi
    end
    y
end

function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
    At = transpose(A)
    for i ∈ eachindex(y)
        yi = zero(eltype(y))
        for j ∈ eachindex(x)
            Atji = At[j, i]
            yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
        end
        y[i] = yi
    end
    y
end

Design choice: Should I add

    model::Union{Val{1}, Val{2}, Val{3}}
    center::Bool
    scale::Bool
    μ::Vector{T}
    σinv::Vector{T}
    storagev1::Vector{T}
    storagev2::Vector{T}

to struct SnpArray, or should I create a separate struct, say, SnpLinAlg?

kose-y commented 4 years ago

My bad. I made a mistake with ordering. The computation with SnpArray is only 12-15% slower.

using SnpArrays

const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);

size(EUR)

(379, 54051)

using LinearAlgebra
using BenchmarkTools

function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T 
    packedstride = size(A.data, 1)
    m, n = size(A)
    fill!(y, zero(eltype(y)))
    @avx for j ∈ eachindex(x)
        for i ∈ eachindex(y)
            Aij = A[i, j]
            y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
    end
    y
end

v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2));

@benchmark LinearAlgebra.mul!($v1, $EUR, $v2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.440 ms (0.00% GC)
  median time:      56.042 ms (0.00% GC)
  mean time:        56.163 ms (0.00% GC)
  maximum time:     62.929 ms (0.00% GC)
  --------------
  samples:          90
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v1, $EURbm, $v2))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.126 ms (0.00% GC)
  median time:      49.592 ms (0.00% GC)
  mean time:        49.645 ms (0.00% GC)
  maximum time:     50.705 ms (0.00% GC)
  --------------
  samples:          101
  evals/sample:     1

function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
    At = transpose(A)
    for i ∈ eachindex(y)
        yi = zero(eltype(y))
        for j ∈ eachindex(x)
            Atji = At[j, i]
            yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
        end
        y[i] = yi
    end
    y
end

@benchmark LinearAlgebra.mul!($v2, transpose($EUR), $v1)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     62.287 ms (0.00% GC)
  median time:      63.172 ms (0.00% GC)
  mean time:        63.230 ms (0.00% GC)
  maximum time:     65.554 ms (0.00% GC)
  --------------
  samples:          80
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v2, transpose($EURbm), $v1))

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     54.606 ms (0.00% GC)
  median time:      55.116 ms (0.00% GC)
  mean time:        55.210 ms (0.00% GC)
  maximum time:     56.601 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

kose-y commented 4 years ago

With @avx, it becomes slightly faster, faster than SnpBitMatrix for transposed SnpArray. (maybe we can accelerate BitMatrix operation further with @avx?)

using SnpArrays

const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);

size(EUR)

(379, 54051)

using LinearAlgebra
using BenchmarkTools
using LoopVectorization

function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T 
    packedstride = size(A.data, 1)
    m, n = size(A)
    fill!(y, zero(eltype(y)))
    @avx for j ∈ eachindex(x)
        for i ∈ eachindex(y)
            Aij = A[i, j]
            y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
    end
    y
end

v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2));

@benchmark LinearAlgebra.mul!($v1, $EUR, $v2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.302 ms (0.00% GC)
  median time:      56.611 ms (0.00% GC)
  mean time:        56.698 ms (0.00% GC)
  maximum time:     58.219 ms (0.00% GC)
  --------------
  samples:          89
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v1, $EURbm, $v2))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     48.817 ms (0.00% GC)
  median time:      49.528 ms (0.00% GC)
  mean time:        49.651 ms (0.00% GC)
  maximum time:     52.581 ms (0.00% GC)
  --------------
  samples:          101
  evals/sample:     1

function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
    At = transpose(A)
    @avx for i ∈ eachindex(y)
        yi = zero(eltype(y))
        for j ∈ eachindex(x)
            Atji = At[j, i]
            yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
        end
        y[i] = yi
    end
    y
end

@benchmark LinearAlgebra.mul!($v2, transpose($EUR), $v1)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     51.079 ms (0.00% GC)
  median time:      51.783 ms (0.00% GC)
  mean time:        51.827 ms (0.00% GC)
  maximum time:     53.326 ms (0.00% GC)
  --------------
  samples:          97
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v2, transpose($EURbm), $v1))

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     54.780 ms (0.00% GC)
  median time:      55.580 ms (0.00% GC)
  mean time:        56.006 ms (0.00% GC)
  maximum time:     61.846 ms (0.00% GC)
  --------------
  samples:          90
  evals/sample:     1