DataHaskell / dh-core

Functional data science
138 stars 23 forks source link

Add `Fast` modules to DLA #59

Closed Magalame closed 5 years ago

Magalame commented 5 years ago

The current default modules in DLA are optimized for accuracy, with the use of the Numeric.Sum module. This is good for certain numeric applications, but other would benefit more from a focus on performance.

The new modules are Statistics.Matrix.Fast, and Statistics.Matrix.Fast.Algorithms. Since doing away with the Sum module allows for more general modification, I used a more heavily imperative style, for w finer control over allocation. However a future PR should bring in a more functional style, with generalization added. With more details:

for example norm . column $ m would not fuse and would actually allocate a column the solution adopted here was to manually fuse operations. The problem is that situations like norm . multiply m1 . multiply m2 $ m3 would still allocate two matrices. The goal would be to allow for the fusing of the whole operation, that would be pretty amazing. I wonder what would the effect in the case of, say, matrix exponentiation.

The whole thing probably just boils down to how to use Stream and New properly enough to guarantee the stream fusion we want.

Magalame commented 5 years ago

From the sweet sweet benchmarks:

---Benchmarking light operations---       
norm
  t=4.61(64)·10⁻⁷s σ=9.0% n=391,746

Fast.norm
  t=3.08(86)·10⁻⁷s σ=1.7·10% n=499,879

multiplyV
  t=4.63(67)·10⁻⁵s σ=9.3% n=15,052
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Fast.multiplyV
  t=1.76(47)·10⁻⁵s σ=1.7·10% n=29,158
  ▀▀▀▀▀▀
transpose
  t=9.7(14)·10⁻⁵s σ=9.4% n=9,371
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
ident
  t=2.39(41)·10⁻⁶s σ=1.0·10% n=120,027

diag
  t=2.41(48)·10⁻⁶s σ=1.2·10% n=118,830

---Benchmarking heavy operations---       
multiply
  t=4.53(57)·10⁻³s σ=3.6% n=64
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Fast.multiply
  t=2.08(40)·10⁻³s σ=5.6% n=128
  ▀▀▀▀▀▀▀▀▀▀▀
qr
  t=5.8(12)·10⁻³s σ=6.0% n=64
  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Fast.qr
  t=1.64(25)·10⁻³s σ=4.4% n=128
  ▀▀▀▀▀▀▀▀
---Benchmarking memory consumption---     

Case             Allocated  GCs           
norm                    16    0           
Fast.norm               16    0           
multiplyV              848    0           
Fast.multiplyV         848    0           
transpose           80,080    0           
ident               80,928    0           
diag                80,080    0           
multiply            80,080    0           
Fast.multiply       80,144    0           
qr              31,845,848   30           
Fast.qr            160,248    0
Magalame commented 5 years ago

Now I'm wondering why is transpose so slow? Is it because of U.generate ? More benchies to come!

Magalame commented 5 years ago

ping @ocramz I think I'll let the transpose question for later, it might be more interesting to already have this PR merged before moving on!

ocramz commented 5 years ago

Hi @Magalame , I'm a bit packed at work these days so haven't had the time to look at this properly, give me a couple more days :)

Magalame commented 5 years ago

No problem, good luck!