Closed Magalame closed 5 years ago
ping @ocramz
Results of inline vs non inline (I kept multiply
not inlined in both cases):
not inlined:
norm
t=4.87(27)·10⁻⁵s σ=3.1% n=11,418
▀▀▀▀▀▀▀▀▀▀▀▀▀
row
t=2.7(27)·10⁻⁹s σ=6.2·10% n=10,699,758
column
t=1.92(42)·10⁻⁷s σ=1.2·10% n=553,808
multiplicationV
t=1.84(14)·10⁻⁵s σ=4.4% n=22,668
▀▀▀▀▀
transpose
t=1.069(94)·10⁻⁴s σ=5.2% n=6,996
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
identity
t=2.70(44)·10⁻⁶s σ=9.3% n=87,962
diag
t=2.66(42)·10⁻⁶s σ=9.4% n=95,454
generate
t=7.3(10)·10⁻⁵s σ=8.0% n=8,746
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
"---Benchmarking heavy operations---"
multiplication
t=5.0(36)·10⁻³s σ=2.3·10% n=102
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
qr factorization
t=3.93(39)·10⁻³s σ=3.4% n=161
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
inlined:
norm
t=5.11(20)·10⁻⁵s σ=2.2% n=11,185
▀▀▀▀▀▀▀▀▀▀▀▀▀▀
row
t=2.4(24)·10⁻⁹s σ=6.2·10% n=11,791,747
column
t=1.80(47)·10⁻⁷s σ=1.4·10% n=540,610
multiplicationV
t=5.09(23)·10⁻⁵s σ=2.6% n=11,231
▀▀▀▀▀▀▀▀▀▀▀▀▀▀
transpose
t=1.071(75)·10⁻⁴s σ=3.9% n=6,478
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
identity
t=2.74(85)·10⁻⁶s σ=1.6·10% n=79,938
▀
diag
t=2.73(64)·10⁻⁶s σ=1.2·10% n=79,122
generate
t=8.0(13)·10⁻⁵s σ=9.0% n=7,683
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
"---Benchmarking heavy operations---"
multiplication
t=5.60(21)·10⁻³s σ=1.2% n=102
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
qr factorization
t=8.10(94)·10⁻³s σ=3.4% n=61
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
not inlined
Case Allocated GCs
norm 16 0
row 32 0
column 848 0
multiplication 8,720,080 8
multiplicationV 848 0
qr factorization 8,721,032 8
transpose 80,080 0
identity 80,928 0
diag 80,080 0
generate 880,080 0
inlined
Case Allocated GCs
norm 16 0
row 32 0
column 848 0
multiplication 240,064 0
multiplicationV 848 0
qr factorization 35,603,832 33
transpose 80,080 0
identity 80,928 0
diag 80,080 0
generate 880,080 0
so there's a sweet spot between the two. There might be a memory vs runtime choice to make too
Now the benchmarks should be able to run with stack build dense-linear-algebra:chronos-bench
and stack build dense-linear-algebra:weigh-bench
@ocramz I think I'll just leave the inlining untouched, and it can be decided in another PR, so that this one can be merged
52
I had to update the version of stack used considering
chronos-bench
requires a more recentcabal-install
. I added some weigh benchmarks even though it wasn't mentioned in the original issue, it seems to point out thatmultiply
andqr
create a lot of vectors in during the process, hence a fairly large allocation, and GC use.Also,
transpose
allocates a full new matrix, and I was contemplating Julia's approach (which I suspect to also be LAPACK's, since it seems to be an O(1) operation there):transpose Matrix v
actually returnsTranspose (Matrix v)
whose only difference withMatrix v
is the indexing function:index (Tranpose (Matrix v)) i j = index (Matrix v) j i
I was also wondering if there was a particular reason why
-O2
isn't enabled?