kose-y commented 4 years ago

New benchmark: SnpLinAlg is now 2x faster, still about 2x slower than SnpBitMatrix. Still not fast enough to replace or deprecate SnpBitMatrix, but became a better memory-efficient option.

SnpArray linear algebra with LoopVectorization and CUDA

versioninfo()

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_CUDA_USE_BINARYBUILDER = false

using SnpArrays

const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))
const EURbm = SnpBitMatrix{Float64}(EUR, model=ADDITIVE_MODEL, center=false, scale=false);

Let's try with EUR data repeated 100 times: 37900 by 54051.

EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]
EUR_100 = [EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10]

37900×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮                             ⋮  ⋱     ⋮                             ⋮
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x02  0x02  0x02  0x03  0x02  0x03
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x00  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x00  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03

EUR_100_bm = SnpBitMatrix{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false)
EUR_100_sla = SnpLinAlg{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);

ENV["JULIA_CUDA_USE_BINARYBUILDER"] = "false"
using CUDA
EUR_100_cu = CuSnpArray{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);

┌ Warning: `haskey(::TargetIterator, name::String)` is deprecated, use `Target(; name = name) !== nothing` instead.
│   caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDA /home/kose/.julia/packages/CUDA/5t6R9/deps/compatibility.jl:176

$y = Ax$

using LinearAlgebra
using BenchmarkTools

v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));

Direct linear algebra on a SnpArray:

@benchmark LinearAlgebra.mul!($v1, $EUR_100_sla, $v2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.522 s (0.00% GC)
  median time:      2.598 s (0.00% GC)
  mean time:        2.598 s (0.00% GC)
  maximum time:     2.675 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

The below is the benchmark for SnpBitMatrix:

@benchmark (LinearAlgebra.mul!($v1, $EUR_100_bm, $v2))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.301 s (0.00% GC)
  median time:      1.342 s (0.00% GC)
  mean time:        1.338 s (0.00% GC)
  maximum time:     1.369 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

Let's try CUDA. The device is Nvidia Titan V.

using Adapt

Moving data to GPU:

v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

using BenchmarkTools
@benchmark LinearAlgebra.mul!($v1_d, $EUR_100_cu, $v2_d)

┌ Warning: `Target(triple::String)` is deprecated, use `Target(; triple = triple)` instead.
│   caller = ip:0x0
└ @ Core :-1

BenchmarkTools.Trial: 
  memory estimate:  1.44 KiB
  allocs estimate:  54
  --------------
  minimum time:     22.024 ms (0.00% GC)
  median time:      22.112 ms (0.00% GC)
  mean time:        22.495 ms (0.00% GC)
  maximum time:     41.697 ms (0.00% GC)
  --------------
  samples:          223
  evals/sample:     1

The speedup is obvious. Let's check correctness:

isapprox(collect(v1_d), v1)

true

$A^T x$

v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2))
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

@benchmark LinearAlgebra.mul!($v2, transpose($EUR_100_sla), $v1)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.607 s (0.00% GC)
  median time:      1.612 s (0.00% GC)
  mean time:        1.612 s (0.00% GC)
  maximum time:     1.617 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_100_bm), $v1))

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     603.551 ms (0.00% GC)
  median time:      608.606 ms (0.00% GC)
  mean time:        614.890 ms (0.00% GC)
  maximum time:     639.172 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1

@benchmark LinearAlgebra.mul!($v2_d, transpose($EUR_100_cu), $v1_d)

BenchmarkTools.Trial: 
  memory estimate:  1.42 KiB
  allocs estimate:  53
  --------------
  minimum time:     26.548 ms (0.00% GC)
  median time:      26.702 ms (0.00% GC)
  mean time:        26.833 ms (0.00% GC)
  maximum time:     29.790 ms (0.00% GC)
  --------------
  samples:          187
  evals/sample:     1

isapprox(collect(v2_d), v2)

true

coveralls commented 4 years ago

Coverage decreased (-0.06%) to 81.132% when pulling 9561b8fab1e91f48fd1f2d291a56989e546a89d1 on kose-y:master into 41d0d323850c9c5d310ce05374b6bfd523355051 on OpenMendel:master.

coveralls commented 4 years ago

Coverage decreased (-0.06%) to 81.132% when pulling fe898785f357c5cd570d4bba82c452adf3c6a47f on kose-y:master into 41d0d323850c9c5d310ce05374b6bfd523355051 on OpenMendel:master.

OpenMendel / SnpArrays.jl

use avx properly for SnpLinAlg #61

SnpArray linear algebra with LoopVectorization and CUDA

$y = Ax$

$A^T x$