Closed kose-y closed 4 years ago
LoopVectorization.jl would be useful for this.
Example: Matrix-vector multiplication r = Av where A[i,j] = a[i] b[j] I(i <= j) with LoopVectorization.jl v0.6.21
versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)
using LoopVectorization
using BenchmarkTools
function matvec_basic!(out, a, b, v)
m = length(a)
@inbounds @simd for i in 1:m
outi = zero(eltype(a))
for j in 1:m
if i <= j
outi += v[j] * a[i] * b[j]
end
end
out[i] = outi
end
return out
end
matvec_basic! (generic function with 1 method)
function matvec_avx!(out, a, b, v)
m = length(a)
# direct operations on indices are not supported
indarr = collect(1:m)
@avx for i in 1:m
outi = zero(eltype(a))
for j in 1:m
outi += ifelse(indarr[i] <= indarr[j],
v[j] * a[i] * b[j], zero(eltype(a)))
end
out[i] = outi
end
return out
end
matvec_avx! (generic function with 1 method)
sz = 500
a = randn(sz); b = randn(sz); v = randn(sz)
out_basic = randn(sz); out_avx = randn(sz);
isapprox(matvec_avx!(out_avx, a, b, v), matvec_basic!(out_basic, a, b, v))
true
@benchmark matvec_avx!($out_avx, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 4.06 KiB
allocs estimate: 1
--------------
minimum time: 84.622 μs (0.00% GC)
median time: 95.164 μs (0.00% GC)
mean time: 96.968 μs (0.00% GC)
maximum time: 167.482 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
@benchmark matvec_basic!($out_basic, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 213.102 μs (0.00% GC)
median time: 227.754 μs (0.00% GC)
mean time: 231.575 μs (0.00% GC)
maximum time: 322.697 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
The implementation with LoopVectorization is much faster. However, it should be used with many cautions, as it is still in rapid development. Some Julia syntaxes are unsupported (e.g. if-else statement), and unexpected numerical bugs might arise. For example, changing outi += ifelse(indarr[i] <= indarr[j], v[j] * a[i] * b[j], zero(eltype(a)))
to outi += ifelse(i <= j, v[j] * a[i] * b[j], zero(eltype(a)))
simply gives incorrect result without any warning.
It now supports some BitMatrix operations. (added less than 24 hours ago)
This is a related issue. For optimal performance, we need to fill in the 2 TODOs when matrix size not a multiple of 512 and also do the unrolling as mentioned in that post.
using VectorizationBase: gesp, stridedpointer
function gemv_avx_512by512!(c, A, b)
@avx for j in 1:512, i in 1:512
c[i] += A[i, j] * b[j]
end
end
function gemv_tile!(c, A, b)
M, N = size(A)
Miter = M >>> 9 # fast div(M, 512)
Mrem = M & 511 # fast rem(M, 512)
Niter = N >>> 9
Nrem = N & 511
GC.@preserve c A b for n in 0:Niter-1
for m in 0:Miter-1
gemv_avx_512by512!(
gesp(stridedpointer(c), (512m,)),
gesp(stridedpointer(A), (8m, 8n)),
gesp(stridedpointer(b), (512n,))
)
end
# TODO: handle mrem
end
# TODO: handle nrem
end
I was going to tackle it but you are much better at this, would you want to give it a shot? Except this is still within the regime of converting the SnpArray
into a SnpBitMatrix
, so perhaps this is not all that related
For example, changing outi += ifelse(indarr[i] <= indarr[j], v[j] a[i] b[j], zero(eltype(a))) to outi += ifelse(i <= j, v[j] a[i] b[j], zero(eltype(a))) simply gives incorrect result without any warning.
Please file issues when you notice problems! Some bugs are straightforward to fix -- I just need to be aware of them. It will work on LoopVectorization 0.6.27 when it's released in about an hour after making this post.
I assume there's a reason you didn't just make your inner loop for j in i:m
and get rid of the if statement?
using LoopVectorization, BenchmarkTools
function matvec_avx!(out, a, b, v)
m = length(a)
@avx for i in 1:m
outi = zero(eltype(a))
for j in 1:m
outi += ifelse(i <= j,
v[j] * a[i] * b[j], zero(eltype(a)))
end
out[i] = outi
end
return out
end
function matvec_basic!(out, a, b, v)
m = length(a)
@inbounds @simd for i in 1:m
outi = zero(eltype(a))
for j in 1:m
if i <= j
outi += v[j] * a[i] * b[j]
end
end
out[i] = outi
end
return out
end
function matvec_basic_iv!(out, a, b, v)
m = length(a)
@inbounds for i in 1:m
outi = zero(eltype(a))
@simd for j in i:m
outi += v[j] * a[i] * b[j]
end
out[i] = outi
end
return out
end
function matvec_avx_iv!(out, a, b, v)
m = length(a)
@inbounds for i in 1:m
outi = zero(eltype(a))
@avx for j in i:m
outi += v[j] * a[i] * b[j]
end
out[i] = outi
end
return out
end
sz = 500
a = randn(sz); b = randn(sz); v = randn(sz);
out_basic = randn(sz);
out_avx = randn(sz);
out_basic_iv = randn(sz);
out_avx_iv = randn(sz);
matvec_avx!(out_avx, a, b, v) ≈ matvec_basic!(out_basic, a, b, v)
matvec_basic_iv!(out_basic_iv, a, b, v) ≈ out_basic
matvec_avx_iv!(out_avx_iv, a, b, v) ≈ out_basic
@benchmark matvec_avx!($out_avx, $a, $b, $v)
@benchmark matvec_basic!($out_basic, $a, $b, $v)
@benchmark matvec_avx_iv!($out_avx_iv, $a, $b, $v)
@benchmark matvec_basic_iv!($out_basic_iv, $a, $b, $v)
This yields:
julia> matvec_avx!(out_avx, a, b, v) ≈ matvec_basic!(out_basic, a, b, v)
true
julia> matvec_basic_iv!(out_basic_iv, a, b, v) ≈ out_basic
true
julia> matvec_avx_iv!(out_avx_iv, a, b, v) ≈ out_basic
true
julia> @benchmark matvec_avx!($out_avx, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.449 μs (0.00% GC)
median time: 16.464 μs (0.00% GC)
mean time: 16.487 μs (0.00% GC)
maximum time: 31.922 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark matvec_basic!($out_basic, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 163.726 μs (0.00% GC)
median time: 163.804 μs (0.00% GC)
mean time: 165.766 μs (0.00% GC)
maximum time: 214.120 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark matvec_avx_iv!($out_avx_iv, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.923 μs (0.00% GC)
median time: 9.955 μs (0.00% GC)
mean time: 9.970 μs (0.00% GC)
maximum time: 23.192 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark matvec_basic_iv!($out_basic_iv, $a, $b, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.941 μs (0.00% GC)
median time: 16.034 μs (0.00% GC)
mean time: 16.055 μs (0.00% GC)
maximum time: 27.664 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
As @biona001 notes, you'll get better performance with extra tiling.
@chriselrod Thanks for coming here. I'm impressed with the quick development of this package. My actual computation is block-triangular (I see my loop is not the best way). There are a couple of more issues I found, and I will file them soon.
These work for additive model, no centering or scaling. About twice slower than SnpBitMatrix
, but memory efficient (no allocation of two Bitmatrix
es), with or without @avx
macro.
function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T
for i ∈ eachindex(y)
yi = zero(eltype(y))
for j ∈ eachindex(x)
Aij = A[i, j]
yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
y[i] = yi
end
y
end
function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
At = transpose(A)
for i ∈ eachindex(y)
yi = zero(eltype(y))
for j ∈ eachindex(x)
Atji = At[j, i]
yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
end
y[i] = yi
end
y
end
Design choice: Should I add
model::Union{Val{1}, Val{2}, Val{3}}
center::Bool
scale::Bool
μ::Vector{T}
σinv::Vector{T}
storagev1::Vector{T}
storagev2::Vector{T}
to struct SnpArray
, or should I create a separate struct
, say, SnpLinAlg
?
My bad. I made a mistake with ordering. The computation with SnpArray is only 12-15% slower.
using SnpArrays
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);
size(EUR)
(379, 54051)
using LinearAlgebra
using BenchmarkTools
function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T
packedstride = size(A.data, 1)
m, n = size(A)
fill!(y, zero(eltype(y)))
@avx for j ∈ eachindex(x)
for i ∈ eachindex(y)
Aij = A[i, j]
y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
end
y
end
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2));
@benchmark LinearAlgebra.mul!($v1, $EUR, $v2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 55.440 ms (0.00% GC)
median time: 56.042 ms (0.00% GC)
mean time: 56.163 ms (0.00% GC)
maximum time: 62.929 ms (0.00% GC)
--------------
samples: 90
evals/sample: 1
@benchmark (LinearAlgebra.mul!($v1, $EURbm, $v2))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 49.126 ms (0.00% GC)
median time: 49.592 ms (0.00% GC)
mean time: 49.645 ms (0.00% GC)
maximum time: 50.705 ms (0.00% GC)
--------------
samples: 101
evals/sample: 1
function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
At = transpose(A)
for i ∈ eachindex(y)
yi = zero(eltype(y))
for j ∈ eachindex(x)
Atji = At[j, i]
yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
end
y[i] = yi
end
y
end
@benchmark LinearAlgebra.mul!($v2, transpose($EUR), $v1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 62.287 ms (0.00% GC)
median time: 63.172 ms (0.00% GC)
mean time: 63.230 ms (0.00% GC)
maximum time: 65.554 ms (0.00% GC)
--------------
samples: 80
evals/sample: 1
@benchmark (LinearAlgebra.mul!($v2, transpose($EURbm), $v1))
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 54.606 ms (0.00% GC)
median time: 55.116 ms (0.00% GC)
mean time: 55.210 ms (0.00% GC)
maximum time: 56.601 ms (0.00% GC)
--------------
samples: 91
evals/sample: 1
With @avx
, it becomes slightly faster, faster than SnpBitMatrix
for transposed SnpArray. (maybe we can accelerate BitMatrix operation further with @avx
?)
using SnpArrays
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);
size(EUR)
(379, 54051)
using LinearAlgebra
using BenchmarkTools
using LoopVectorization
function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T
packedstride = size(A.data, 1)
m, n = size(A)
fill!(y, zero(eltype(y)))
@avx for j ∈ eachindex(x)
for i ∈ eachindex(y)
Aij = A[i, j]
y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
end
y
end
v1 = randn(size(EUR, 1))
v2 = randn(size(EUR, 2));
@benchmark LinearAlgebra.mul!($v1, $EUR, $v2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 55.302 ms (0.00% GC)
median time: 56.611 ms (0.00% GC)
mean time: 56.698 ms (0.00% GC)
maximum time: 58.219 ms (0.00% GC)
--------------
samples: 89
evals/sample: 1
@benchmark (LinearAlgebra.mul!($v1, $EURbm, $v2))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 48.817 ms (0.00% GC)
median time: 49.528 ms (0.00% GC)
mean time: 49.651 ms (0.00% GC)
maximum time: 52.581 ms (0.00% GC)
--------------
samples: 101
evals/sample: 1
function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
At = transpose(A)
@avx for i ∈ eachindex(y)
yi = zero(eltype(y))
for j ∈ eachindex(x)
Atji = At[j, i]
yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
end
y[i] = yi
end
y
end
@benchmark LinearAlgebra.mul!($v2, transpose($EUR), $v1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 51.079 ms (0.00% GC)
median time: 51.783 ms (0.00% GC)
mean time: 51.827 ms (0.00% GC)
maximum time: 53.326 ms (0.00% GC)
--------------
samples: 97
evals/sample: 1
@benchmark (LinearAlgebra.mul!($v2, transpose($EURbm), $v1))
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 54.780 ms (0.00% GC)
median time: 55.580 ms (0.00% GC)
mean time: 56.006 ms (0.00% GC)
maximum time: 61.846 ms (0.00% GC)
--------------
samples: 90
evals/sample: 1
I think this is doable and operations based on 8-bit integers are more easily translated to other targets (e.g. GPUs) than the current
BitMatrix
-based version.