Fusion can't always see through tuples

robeverest commented 9 years ago

The current fusion system doesn't always work very well in the presence of array level tuples. For example:

> import Data.Array.Accelerate as A
> let a = lift ( A.fill (index1 10) 1.0, A.fill (index1 10) 2.0 ) :: Acc (Vector Float, Vector Float)
> A.zipWith (+) (A.afst a) (A.asnd a)
let a0 = (generate (Z :. 10) (\x0 -> 1.0),generate (Z :. 10) (\x0 -> 2.0)) in
let a1 = #1 a0 in
let a2 = #0 a0
in generate (intersect (shape a1) (shape a2)) (\x0 -> (a1!x0) + (a2!x0))

We have two needlessly manifest arrays here. The current fusion system is not able to see that while a is referenced twice, each of its components are really only used once, so are able to be embedded and fused.

I think the simplest way to solve this, in a lot of cases anyway, is to have a pass before fusion that turns terms of the form

let t = (a, b, ...)
in ... #0 t ... #1 t ...

into

let t0 = a
    t1 = b
    ...
in ... t0 ... t1 ...

i.e.Let bound tuples are expanded into multiple bindings. I'm pretty certain this normal form has a name, but I can't think what it is (anyone know?).

This problem is of particular pain with the vectoriser, where we represent nested arrays as a tuple of segment descriptors and a vector of elements, so lots of stuff that could easily be fused isn't. For this reason, I was going to go ahead and implement this in the streaming branch, but I thought I should check with you @tmcdonell first because you might have had thoughts on this before?

(ping @fmma)

tmcdonell commented 9 years ago

I've been working on the simplifier recently, where similar problems happened at the scalar level. That looks to be working pretty well now though, so the ideas should (probably) transfer to the array level (although note that the simplifier is allowed to iterate, whereas the array level is single pass; this might be an issue).

I know what you are trying to do, but it is not a great idea. Consider:

let t = let expensive
        in (a, b ..)
in .. (#0 t) .. (#1 t) ..

Straightforwardly doing what you proposed above is going to duplicate the expensive part:

let t0 = let expensive in a
    t1 = let expensive in b
in .. t0 .. t1 ..

You can fix this by lifting out the expensive bit and sharing it, but note now the intermediate program is huge. BOOM. Anyway, eventually you could reduce down to a single spine of lets:

(By the way check the simplifier tick counts and iteration steps, which have increased drastically since the 0.15 branch. I am guessing it is because your changes to the substitution module prevented the rewrite rules, although that is admittedly fragile and shouldn't have been relied upon. Anyway, that's a different ticket on something else we should look at.)

let x = expensive
    t0 = a
    t1 = b

However, this is bad because it has implications for memory management. Previously, anything required by the expensive block to compute a and b was only alive in that small scope, but because we pulled it to the top level it is now alive forever. BOOM.

Anyway, yes, this is something that we should look at, but we have to be careful. Your proposed solution technically works, but is bad for various reasons.

tmcdonell commented 8 years ago

poke @robeverest so when is this going to land in master?

tmcdonell commented 7 years ago

We've talked about this in person a bit, but I'm not a fan of the current eltFlavour setup.

tmcdonell commented 6 years ago

I've been coping these patches into master (I don't know if you get notified if I tag you in the commit message @robeverest ?). See branch https://github.com/tmcdonell/accelerate/tree/wip/tuples.

Seems to be working well, all the examples are working. However there is currently a serious performance regression. The lulesh program with these patches (tmcdonell/accelerate@9b1cc6e4efe270eb948d4fae3fdc488c077d66ff):

[ 116.268] phase array-fusion: 116.238 s (wall), 368.978 s (cpu)
    121.9 GB allocated on the heap
    80.9 GB copied during GC (268 collections)
    MUT: 74.746 s (wall), 74.243 s (cpu)
    GC:  41.492 s (wall), 294.734 s (cpu)
[ 116.770] Total ticks: 35481

7574 Inline
  4267 prj/Tuple
  1870 prj/Var
  1271 prj/Let
  144 prj/Const
  22 Var
765 RuleFired
  154 shape/Z
  104 commutes *
  97 commutes +
  91 x+0
  88 mapD
  54 x*1
  40 generateD
  36 aletD/float
  22 aletD/bind
  18 -y+x
  14 aletD/eliminate
  11 intersect
  7 backpermuteD
  6 zipWithD
  5 transformD
  4 commutes ==
  4 commutes min
  4 x/1
  3 commutes max
  2 reshapeD
  1 elim/unzip
3 KnownBranch
  3 True
2410 BetaReduce
  1914 inline exp
  496 dead exp
23413 Substitution
  19040 weakenE
  2685 inline
  1280 shrink exp
  205 weaken
  123 compose
  38 constant fold
  20 replaceE/!
  19 replaceE/shape
  1 sink
  1 sink1
  1 substitute
1316 SimplifierDone
  1316

vs without (tmcdonell/accelerate@2571d72e7f987033f2bbddd4d24295f5a46257ce):

[   1.271] phase array-fusion: 1.175 s (wall), 1.218 s (cpu)
    1.4 GB allocated on the heap
    51.9 MB copied during GC (3 collections)
    MUT: 1.138 s (wall), 1.029 s (cpu)
    GC:  37.318 ms (wall), 189.097 ms (cpu)
[   1.388] Total ticks: 19312

1331 Inline
  658 prj/Var
  516 prj/Tuple
  144 prj/Const
  11 prj/Let
  2 Var
473 RuleFired
  104 commutes *
  97 commutes +
  91 x+0
  54 x*1
  21 aletD/float
  18 -y+x
  15 generateD
  14 unzipD
  8 aletD/bind
  7 mapD
  6 aletD/eliminate
  6 zipWithD
  5 intersect
  4 backpermuteD
  4 commutes ==
  4 commutes min
  4 shape/Z
  4 x/1
  3 commutes max
  2 reshapeD
  2 transformD
3 KnownBranch
  3 True
135 BetaReduce
  117 dead exp
  18 inline exp
17227 Substitution
  16875 weakenE
  137 inline
  128 shrink exp
  47 weaken
  17 compose
  14 constant fold
  4 replaceE/shape
  2 replaceE/!
  1 sink
  1 sink1
  1 substitute
143 SimplifierDone
  143

AccelerateHS / accelerate

Fusion can't always see through tuples #268