Magalame commented 5 years ago

52

I had to update the version of stack used considering chronos-bench requires a more recent cabal-install. I added some weigh benchmarks even though it wasn't mentioned in the original issue, it seems to point out that multiply and qr create a lot of vectors in during the process, hence a fairly large allocation, and GC use.

Also, transpose allocates a full new matrix, and I was contemplating Julia's approach (which I suspect to also be LAPACK's, since it seems to be an O(1) operation there): transpose Matrix v actually returns Transpose (Matrix v) whose only difference with Matrix v is the indexing function: index (Tranpose (Matrix v)) i j = index (Matrix v) j i

I was also wondering if there was a particular reason why -O2 isn't enabled?

Magalame commented 5 years ago

ping @ocramz

Magalame commented 5 years ago

Results of inline vs non inline (I kept multiply not inlined in both cases):

not inlined:

norm
  t=4.87(27)·10⁻⁵s σ=3.1% n=11,418
  ▀▀▀▀▀▀▀▀▀▀▀▀▀
row
  t=2.7(27)·10⁻⁹s σ=6.2·10% n=10,699,758

column
  t=1.92(42)·10⁻⁷s σ=1.2·10% n=553,808

multiplicationV
  t=1.84(14)·10⁻⁵s σ=4.4% n=22,668
  ▀▀▀▀▀
transpose
  t=1.069(94)·10⁻⁴s σ=5.2% n=6,996
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
identity
  t=2.70(44)·10⁻⁶s σ=9.3% n=87,962

diag
  t=2.66(42)·10⁻⁶s σ=9.4% n=95,454

generate
  t=7.3(10)·10⁻⁵s σ=8.0% n=8,746
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
"---Benchmarking heavy operations---"     
multiplication
  t=5.0(36)·10⁻³s σ=2.3·10% n=102
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
qr factorization
  t=3.93(39)·10⁻³s σ=3.4% n=161
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

inlined:

norm
  t=5.11(20)·10⁻⁵s σ=2.2% n=11,185
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀
row
  t=2.4(24)·10⁻⁹s σ=6.2·10% n=11,791,747

column
  t=1.80(47)·10⁻⁷s σ=1.4·10% n=540,610

multiplicationV
  t=5.09(23)·10⁻⁵s σ=2.6% n=11,231
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀
transpose
  t=1.071(75)·10⁻⁴s σ=3.9% n=6,478
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
identity
  t=2.74(85)·10⁻⁶s σ=1.6·10% n=79,938
  ▀
diag
  t=2.73(64)·10⁻⁶s σ=1.2·10% n=79,122

generate
  t=8.0(13)·10⁻⁵s σ=9.0% n=7,683
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
"---Benchmarking heavy operations---"     
multiplication
  t=5.60(21)·10⁻³s σ=1.2% n=102
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
qr factorization
  t=8.10(94)·10⁻³s σ=3.4% n=61
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

not inlined

Case              Allocated  GCs          
norm                     16    0          
row                      32    0          
column                  848    0          
multiplication    8,720,080    8          
multiplicationV         848    0          
qr factorization  8,721,032    8          
transpose            80,080    0          
identity             80,928    0          
diag                 80,080    0          
generate            880,080    0

inlined

Case               Allocated  GCs         
norm                      16    0         
row                       32    0         
column                   848    0         
multiplication       240,064    0         
multiplicationV          848    0         
qr factorization  35,603,832   33         
transpose             80,080    0         
identity              80,928    0         
diag                  80,080    0         
generate             880,080    0

so there's a sweet spot between the two. There might be a memory vs runtime choice to make too

Magalame commented 5 years ago

Now the benchmarks should be able to run with stack build dense-linear-algebra:chronos-bench and stack build dense-linear-algebra:weigh-bench

Magalame commented 5 years ago

@ocramz I think I'll just leave the inlining untouched, and it can be decided in another PR, so that this one can be merged

DataHaskell / dh-core

Add chronos-bench and weigh benchmarks #56

52