kose-y commented 4 years ago

[x] Use AVX for bitmatrix operation
[x] Create AVX-based linear algebra directly using a SnpArray
[x] CUDA implementation

kose-y commented 4 years ago

Should I create SnpLinAlg as in the draft, or should merge it to the SnpArray struct?

coveralls commented 4 years ago

Coverage decreased (-13.6%) to 81.215% when pulling 2b9396b481ffc2e8b6a77fe537dc8a1494d5633a on kose-y:master into fe9015b412c19e68fd08cbc1498e921c8aa884a3 on OpenMendel:master.

kose-y commented 4 years ago

Juila version lower bound for LoopVectorization.jl is 1.1.

kose-y commented 4 years ago

SnpArray linear algebra with LoopVectorization and CUDA

using SnpArrays

const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);

Let's try with EUR data repeated 100 times: 37900 by 54051.

EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]
EUR_100 = [EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10]

37900×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮                             ⋮  ⋱     ⋮                             ⋮
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x02  0x02  0x02  0x03  0x02  0x03
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x00  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x00  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03

EUR_100_bm = SnpBitMatrix{Float64}(EUR_100, model=ADDITIVE_MODEL, center=false, scale=false);

Before we begin, we define bitmatrix-vector multiplication with @avx:

using LinearAlgebra, LoopVectorization
function LinearAlgebra.mul!(c::Vector{T}, A::BitMatrix, b::Vector{T}) where T
    fill!(c, zero(eltype(c)))
    @avx for j in 1:size(A, 2)
        for i in 1:size(A, 1)
            c[i] += A[i, j] * b[j]
        end
    end
end

function LinearAlgebra.mul!(c::Vector{T}, A::Transpose{Bool,BitArray{2}}, b::Vector{T}) where T
    tA = transpose(A)
    fill!(c, zero(eltype(c)))
    @avx for i in 1:size(tA, 2)
        for j in 1:size(tA, 1)
            c[i] += tA[j, i] * b[j]
        end
    end
end

$y = Ax$

using LinearAlgebra
using BenchmarkTools
using LoopVectorization

Direct linear algebra on a SnpArray:

function LinearAlgebra.mul!(y::Vector{T}, A::SnpArray, x::Vector{T}) where T 
    packedstride = size(A.data, 1)
    m, n = size(A)
    fill!(y, zero(eltype(y)))
    @avx for j ∈ eachindex(x)
        for i ∈ eachindex(y)
            Aij = A[i, j]
            y[i] += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
    end
    y
end

v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));

@benchmark LinearAlgebra.mul!($v1, $EUR_100, $v2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.584 s (0.00% GC)
  median time:      5.584 s (0.00% GC)
  mean time:        5.584 s (0.00% GC)
  maximum time:     5.584 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

The below is the benchmark for SnpBitMatrix:

@benchmark (LinearAlgebra.mul!($v1, $EUR_100_bm, $v2))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.298 s (0.00% GC)
  median time:      1.304 s (0.00% GC)
  mean time:        1.308 s (0.00% GC)
  maximum time:     1.326 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

Let's try CUDA. The device is Nvidia Titan V.

ENV["JULIA_CUDA_USE_BINARYBUILDER"] = "false"
using CUDA

using Adapt

Moving data to GPU:

EUR_100_d = adapt(CuArray, EUR_100.data)
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│   caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDA /home/kose/.julia/packages/CUDA/5t6R9/deps/compatibility.jl:176

This is the kernel to compute Ax.

function snparray_gemv_kernel!(y, A, x)
    packedstride = size(A, 1)
    index_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride_x = blockDim().x * gridDim().x

    for i = index_x:stride_x:length(y)
        yi = zero(eltype(y))
        for j = 1:length(x)
            k = 2 * ((i-1) & 3)
            block = A[(j-1) * packedstride + ((i-1) >> 2) + 1]
            Aij = (block >> k) & 3
            @inbounds yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
        y[i] = yi
    end
end

snparray_gemv_kernel! (generic function with 1 method)

function bench_snparray_gemv!(y, A, x)
    numblocks = ceil(Int, length(y)/256)
    CUDA.@sync begin
        @cuda threads=256 blocks=numblocks snparray_gemv_kernel!(y, A, x)
    end
end

bench_snparray_gemv! (generic function with 1 method)

using BenchmarkTools
@benchmark bench_snparray_gemv!($v1_d, $EUR_100_d, $v2_d)

┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│   caller = ip:0x0
└ @ Core :-1

BenchmarkTools.Trial: 
  memory estimate:  1.33 KiB
  allocs estimate:  49
  --------------
  minimum time:     22.611 ms (0.00% GC)
  median time:      22.719 ms (0.00% GC)
  mean time:        22.735 ms (0.00% GC)
  maximum time:     26.494 ms (0.00% GC)
  --------------
  samples:          220
  evals/sample:     1

The speedup is obvious. Let's check correctness:

isapprox(collect(v1_d), v1)

true

$A^T x$

v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2))
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

function LinearAlgebra.mul!(y::Vector{T}, A::LinearAlgebra.Transpose{UInt8,SnpArray}, x::Vector{T}) where T
    At = transpose(A)
    @avx for i ∈ eachindex(y)
        yi = zero(eltype(y))
        for j ∈ eachindex(x)
            Atji = At[j, i]
            yi += (((Atji >= 2) + (Atji >= 3))) * x[j]
        end
        y[i] = yi
    end
    y
end

@benchmark LinearAlgebra.mul!($v2, transpose($EUR_100), $v1)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     5.706 s (0.00% GC)
  median time:      5.706 s (0.00% GC)
  mean time:        5.706 s (0.00% GC)
  maximum time:     5.706 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_100_bm), $v1))

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     603.362 ms (0.00% GC)
  median time:      611.266 ms (0.00% GC)
  mean time:        615.139 ms (0.00% GC)
  maximum time:     639.640 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1

function snparray_gemv_kernel_2!(y, A, x) # y = A^T x
    packedstride = size(A, 1)
    index_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride_x = blockDim().x * gridDim().x

    for i = index_x:stride_x:length(y)
        yi = zero(eltype(y))
        for j = 1:length(x)
            k = 2 * ((j-1) & 3)
            block = A[(i-1) * packedstride + ((j-1) >> 2) + 1]
            Aij = (block >> k) & 3
            @inbounds yi += (((Aij >= 2) + (Aij >= 3))) * x[j]
        end
        y[i] = yi
    end
end

snparray_gemv_kernel_2! (generic function with 1 method)

function bench_snparray_gemv_2!(y, A, x)
    numblocks = ceil(Int, length(y)/256)
    CUDA.@sync begin
        @cuda threads=256 blocks=numblocks snparray_gemv_kernel_2!(y, A, x)
    end
end

bench_snparray_gemv_2! (generic function with 1 method)

@benchmark bench_snparray_gemv_2!($v2_d, $EUR_100_d, $v1_d)

BenchmarkTools.Trial: 
  memory estimate:  1.33 KiB
  allocs estimate:  49
  --------------
  minimum time:     27.135 ms (0.00% GC)
  median time:      27.317 ms (0.00% GC)
  mean time:        27.699 ms (0.00% GC)
  maximum time:     49.526 ms (0.00% GC)
  --------------
  samples:          181
  evals/sample:     1

isapprox(collect(v2_d), v2)

true

kose-y commented 4 years ago

Julia version lower bound for CUDA.jl is 1.4.

kose-y commented 4 years ago

After the benchmark, I don't see the direct linear algebra on CPU being the default linear algebra mode, so I will just add SnpLinAlg separately.

kose-y commented 4 years ago

For the sake of completeness, I'm putting the numbers for the existing mode of linear algebra (BitMatrix w/o avx).

Timing for Ax:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.850 s (0.00% GC)
  median time:      4.881 s (0.00% GC)
  mean time:        4.881 s (0.00% GC)
  maximum time:     4.912 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

Timing for A^T x:

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     5.560 s (0.00% GC)
  median time:      5.560 s (0.00% GC)
  mean time:        5.560 s (0.00% GC)
  maximum time:     5.560 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

CPU information:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:            4
CPU MHz:             800.155
CPU max MHz:         2201.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

kose-y commented 4 years ago

@Hua-Zhou It looks like we need to make one decision before we merge it. Should I change GPU code to deprecated CUDAnative.jl/CuArrays.jl combination to support Julia 1.1+, or should I leave it this way using CUDA.jl to support only Julia 1.4+?

Hua-Zhou commented 4 years ago

Considering Julia v1.5 is imminent, I think it's fine to support v1.4+ only.

On Sat, Jul 25, 2020 at 7:30 PM Seyoon Ko notifications@github.com wrote:

@Hua-Zhou https://github.com/Hua-Zhou It looks like we need to make one decision before we merge it. Should I change GPU code to deprecated CUDAnative.jl/CuArrays.jl combination to support Julia 1.1+, or should I leave it this way using CUDA.jl only support only Julia 1.4+?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/pull/57#issuecomment-663927578, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMNBY6IMMUC6G4LLCRTR5OIK7ANCNFSM4PFTEMHA .

kose-y commented 4 years ago

OK. I am merging it then.

OpenMendel / SnpArrays.jl

Improve linear algebra #57

SnpArray linear algebra with LoopVectorization and CUDA

$y = Ax$

$A^T x$