Closed kose-y closed 4 years ago
Should I create SnpLinAlg as in the draft, or should merge it to the SnpArray
struct?
Juila version lower bound for LoopVectorization.jl is 1.1.
using SnpArrays
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);
Let's try with EUR data repeated 100 times: 37900 by 54051.
EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]
EUR_100 = [EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10]
37900×54051 SnpArray:
0x03 0x03 0x03 0x02 0x02 0x03 … 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x02 0x03 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x03 0x03 0x03 0x02 0x02 0x02 0x03 0x03 0x02
0x03 0x03 0x03 0x00 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x00 0x03 0x03 0x02 0x02 0x02 0x03 0x03 0x03
0x02 0x03 0x03 0x03 0x03 0x03 … 0x03 0x03 0x03 0x03 0x03 0x02
0x02 0x03 0x03 0x02 0x02 0x03 0x03 0x03 0x02 0x02 0x03 0x03
0x02 0x03 0x03 0x03 0x02 0x02 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x00 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x02 0x03 0x03 0x02 0x03 0x02 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x02 0x03 0x03 … 0x03 0x03 0x02 0x02 0x03 0x03
0x03 0x03 0x03 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x02
0x03 0x02 0x03 0x02 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03
⋮ ⋮ ⋱ ⋮ ⋮
0x03 0x03 0x03 0x00 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x02 0x02 0x03 0x02 0x02 0x02 0x03 0x02 0x03
0x03 0x03 0x03 0x02 0x02 0x03 … 0x03 0x03 0x03 0x03 0x03 0x03
0x02 0x03 0x03 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x00 0x00 0x03 0x02 0x02 0x02 0x03 0x03 0x03
0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x02 0x03 0x03 … 0x03 0x03 0x03 0x03 0x03 0x03
0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x02 0x02 0x03 0x03
0x03 0x03 0x03 0x00 0x03 0x03 0x03 0x03 0x03 0x03 0x03 0x03
0x02 0x03 0x03 0x02 0x00 0x02 0x03 0x03 0x03 0x03 0x03 0x03
0x03 0x03 0x03 0x02 0x02 0x03 0x03 0x03 0x03 0x03 0x03 0x03
EUR_100_bm = SnpBitMatrix{Float64}(EUR_100, model=ADDITIVE_MODEL, center=false, scale=false);
Before we begin, we define bitmatrix-vector multiplication with @avx:
using LinearAlgebra, LoopVectorization
function LinearAlgebra.mul!(c::Vector{T}, A::BitMatrix, b::Vector{T}) where T
fill!(c, zero(eltype(c)))
@avx for j in 1:size(A, 2)
for i in 1:size(A, 1)
c[i] += A[i, j] * b[j]
end
end
end
function LinearAlgebra.mul!(c::Vector{T}, A::Transpose{Bool,BitArray{2}}, b::Vector{T}) where T
tA = transpose(A)
fill!(c, zero(eltype(c)))
@avx for i in 1:size(tA, 2)
for j in 1:size(tA, 1)
c[i] += tA[j, i] * b[j]
end
end
end
using LinearAlgebra
using BenchmarkTools
using LoopVectorization
Direct linear algebra on a SnpArray:
function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T
packedstride = size(A.data, 1)
m, n = size(A)
fill!(y, zero(eltype(y)))
@avx for j ∈ eachindex(x)
for i ∈ eachindex(y)
Aij = A[i, j]
y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
end
y
end
v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));
@benchmark LinearAlgebra.mul!($v1, $EUR_100, $v2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.584 s (0.00% GC)
median time: 5.584 s (0.00% GC)
mean time: 5.584 s (0.00% GC)
maximum time: 5.584 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
The below is the benchmark for SnpBitMatrix:
@benchmark (LinearAlgebra.mul!($v1, $EUR_100_bm, $v2))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.298 s (0.00% GC)
median time: 1.304 s (0.00% GC)
mean time: 1.308 s (0.00% GC)
maximum time: 1.326 s (0.00% GC)
--------------
samples: 4
evals/sample: 1
Let's try CUDA. The device is Nvidia Titan V.
ENV["JULIA_CUDA_USE_BINARYBUILDER"] = "false"
using CUDA
using Adapt
Moving data to GPU:
EUR_100_d = adapt(CuArray, EUR_100.data)
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);
┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│ caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDA /home/kose/.julia/packages/CUDA/5t6R9/deps/compatibility.jl:176
This is the kernel to compute Ax.
function snparray_gemv_kernel!(y, A, x)
packedstride = size(A, 1)
index_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride_x = blockDim().x * gridDim().x
for i = index_x:stride_x:length(y)
yi = zero(eltype(y))
for j = 1:length(x)
k = 2 * ((i-1) & 3)
block = A[(j-1) * packedstride + ((i-1) >> 2) + 1]
Aij = (block >> k) & 3
@inbounds yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
y[i] = yi
end
end
snparray_gemv_kernel! (generic function with 1 method)
function bench_snparray_gemv!(y, A, x)
numblocks = ceil(Int, length(y)/256)
CUDA.@sync begin
@cuda threads=256 blocks=numblocks snparray_gemv_kernel!(y, A, x)
end
end
bench_snparray_gemv! (generic function with 1 method)
using BenchmarkTools
@benchmark bench_snparray_gemv!($v1_d, $EUR_100_d, $v2_d)
┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│ caller = ip:0x0
└ @ Core :-1
BenchmarkTools.Trial:
memory estimate: 1.33 KiB
allocs estimate: 49
--------------
minimum time: 22.611 ms (0.00% GC)
median time: 22.719 ms (0.00% GC)
mean time: 22.735 ms (0.00% GC)
maximum time: 26.494 ms (0.00% GC)
--------------
samples: 220
evals/sample: 1
The speedup is obvious. Let's check correctness:
isapprox(collect(v1_d), v1)
true
v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2))
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);
function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
At = transpose(A)
@avx for i ∈ eachindex(y)
yi = zero(eltype(y))
for j ∈ eachindex(x)
Atji = At[j, i]
yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
end
y[i] = yi
end
y
end
@benchmark LinearAlgebra.mul!($v2, transpose($EUR_100), $v1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 5.706 s (0.00% GC)
median time: 5.706 s (0.00% GC)
mean time: 5.706 s (0.00% GC)
maximum time: 5.706 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_100_bm), $v1))
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 603.362 ms (0.00% GC)
median time: 611.266 ms (0.00% GC)
mean time: 615.139 ms (0.00% GC)
maximum time: 639.640 ms (0.00% GC)
--------------
samples: 9
evals/sample: 1
function snparray_gemv_kernel_2!(y, A, x) # y = A^T x
packedstride = size(A, 1)
index_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride_x = blockDim().x * gridDim().x
for i = index_x:stride_x:length(y)
yi = zero(eltype(y))
for j = 1:length(x)
k = 2 * ((j-1) & 3)
block = A[(i-1) * packedstride + ((j-1) >> 2) + 1]
Aij = (block >> k) & 3
@inbounds yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
end
y[i] = yi
end
end
snparray_gemv_kernel_2! (generic function with 1 method)
function bench_snparray_gemv_2!(y, A, x)
numblocks = ceil(Int, length(y)/256)
CUDA.@sync begin
@cuda threads=256 blocks=numblocks snparray_gemv_kernel_2!(y, A, x)
end
end
bench_snparray_gemv_2! (generic function with 1 method)
@benchmark bench_snparray_gemv_2!($v2_d, $EUR_100_d, $v1_d)
BenchmarkTools.Trial:
memory estimate: 1.33 KiB
allocs estimate: 49
--------------
minimum time: 27.135 ms (0.00% GC)
median time: 27.317 ms (0.00% GC)
mean time: 27.699 ms (0.00% GC)
maximum time: 49.526 ms (0.00% GC)
--------------
samples: 181
evals/sample: 1
isapprox(collect(v2_d), v2)
true
Julia version lower bound for CUDA.jl is 1.4.
After the benchmark, I don't see the direct linear algebra on CPU being the default linear algebra mode, so I will just add SnpLinAlg separately.
For the sake of completeness, I'm putting the numbers for the existing mode of linear algebra (BitMatrix w/o avx).
Timing for Ax:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.850 s (0.00% GC)
median time: 4.881 s (0.00% GC)
mean time: 4.881 s (0.00% GC)
maximum time: 4.912 s (0.00% GC)
--------------
samples: 2
evals/sample: 1
Timing for A^T x:
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 5.560 s (0.00% GC)
median time: 5.560 s (0.00% GC)
mean time: 5.560 s (0.00% GC)
maximum time: 5.560 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
CPU information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping: 4
CPU MHz: 800.155
CPU max MHz: 2201.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
@Hua-Zhou It looks like we need to make one decision before we merge it. Should I change GPU code to deprecated CUDAnative.jl/CuArrays.jl combination to support Julia 1.1+, or should I leave it this way using CUDA.jl to support only Julia 1.4+?
Considering Julia v1.5 is imminent, I think it's fine to support v1.4+ only.
On Sat, Jul 25, 2020 at 7:30 PM Seyoon Ko notifications@github.com wrote:
@Hua-Zhou https://github.com/Hua-Zhou It looks like we need to make one decision before we merge it. Should I change GPU code to deprecated CUDAnative.jl/CuArrays.jl combination to support Julia 1.1+, or should I leave it this way using CUDA.jl only support only Julia 1.4+?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/pull/57#issuecomment-663927578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMNBY6IMMUC6G4LLCRTR5OIK7ANCNFSM4PFTEMHA .
OK. I am merging it then.