Open robeverest opened 9 years ago
I've been working on the simplifier recently, where similar problems happened at the scalar level. That looks to be working pretty well now though, so the ideas should (probably) transfer to the array level (although note that the simplifier is allowed to iterate, whereas the array level is single pass; this might be an issue).
I know what you are trying to do, but it is not a great idea. Consider:
let t = let expensive
in (a, b ..)
in .. (#0 t) .. (#1 t) ..
Straightforwardly doing what you proposed above is going to duplicate the expensive part:
let t0 = let expensive in a
t1 = let expensive in b
in .. t0 .. t1 ..
You can fix this by lifting out the expensive bit and sharing it, but note now the intermediate program is huge. BOOM. Anyway, eventually you could reduce down to a single spine of lets:
(By the way check the simplifier tick counts and iteration steps, which have increased drastically since the 0.15 branch. I am guessing it is because your changes to the substitution module prevented the rewrite rules, although that is admittedly fragile and shouldn't have been relied upon. Anyway, that's a different ticket on something else we should look at.)
let x = expensive
t0 = a
t1 = b
However, this is bad because it has implications for memory management. Previously, anything required by the expensive
block to compute a
and b
was only alive in that small scope, but because we pulled it to the top level it is now alive forever. BOOM.
Anyway, yes, this is something that we should look at, but we have to be careful. Your proposed solution technically works, but is bad for various reasons.
poke @robeverest so when is this going to land in master?
We've talked about this in person a bit, but I'm not a fan of the current eltFlavour
setup.
I've been coping these patches into master (I don't know if you get notified if I tag you in the commit message @robeverest ?). See branch https://github.com/tmcdonell/accelerate/tree/wip/tuples.
Seems to be working well, all the examples are working. However there is currently a serious performance regression. The lulesh program with these patches (tmcdonell/accelerate@9b1cc6e4efe270eb948d4fae3fdc488c077d66ff):
[ 116.268] phase array-fusion: 116.238 s (wall), 368.978 s (cpu)
121.9 GB allocated on the heap
80.9 GB copied during GC (268 collections)
MUT: 74.746 s (wall), 74.243 s (cpu)
GC: 41.492 s (wall), 294.734 s (cpu)
[ 116.770] Total ticks: 35481
7574 Inline
4267 prj/Tuple
1870 prj/Var
1271 prj/Let
144 prj/Const
22 Var
765 RuleFired
154 shape/Z
104 commutes *
97 commutes +
91 x+0
88 mapD
54 x*1
40 generateD
36 aletD/float
22 aletD/bind
18 -y+x
14 aletD/eliminate
11 intersect
7 backpermuteD
6 zipWithD
5 transformD
4 commutes ==
4 commutes min
4 x/1
3 commutes max
2 reshapeD
1 elim/unzip
3 KnownBranch
3 True
2410 BetaReduce
1914 inline exp
496 dead exp
23413 Substitution
19040 weakenE
2685 inline
1280 shrink exp
205 weaken
123 compose
38 constant fold
20 replaceE/!
19 replaceE/shape
1 sink
1 sink1
1 substitute
1316 SimplifierDone
1316
vs without (tmcdonell/accelerate@2571d72e7f987033f2bbddd4d24295f5a46257ce):
[ 1.271] phase array-fusion: 1.175 s (wall), 1.218 s (cpu)
1.4 GB allocated on the heap
51.9 MB copied during GC (3 collections)
MUT: 1.138 s (wall), 1.029 s (cpu)
GC: 37.318 ms (wall), 189.097 ms (cpu)
[ 1.388] Total ticks: 19312
1331 Inline
658 prj/Var
516 prj/Tuple
144 prj/Const
11 prj/Let
2 Var
473 RuleFired
104 commutes *
97 commutes +
91 x+0
54 x*1
21 aletD/float
18 -y+x
15 generateD
14 unzipD
8 aletD/bind
7 mapD
6 aletD/eliminate
6 zipWithD
5 intersect
4 backpermuteD
4 commutes ==
4 commutes min
4 shape/Z
4 x/1
3 commutes max
2 reshapeD
2 transformD
3 KnownBranch
3 True
135 BetaReduce
117 dead exp
18 inline exp
17227 Substitution
16875 weakenE
137 inline
128 shrink exp
47 weaken
17 compose
14 constant fold
4 replaceE/shape
2 replaceE/!
1 sink
1 sink1
1 substitute
143 SimplifierDone
143
The current fusion system doesn't always work very well in the presence of array level tuples. For example:
We have two needlessly manifest arrays here. The current fusion system is not able to see that while
a
is referenced twice, each of its components are really only used once, so are able to be embedded and fused.I think the simplest way to solve this, in a lot of cases anyway, is to have a pass before fusion that turns terms of the form
into
i.e.Let bound tuples are expanded into multiple bindings. I'm pretty certain this normal form has a name, but I can't think what it is (anyone know?).
This problem is of particular pain with the vectoriser, where we represent nested arrays as a tuple of segment descriptors and a vector of elements, so lots of stuff that could easily be fused isn't. For this reason, I was going to go ahead and implement this in the streaming branch, but I thought I should check with you @tmcdonell first because you might have had thoughts on this before?
(ping @fmma)