Loop invariant optimisations

thesamovar commented 10 years ago

We should add the ability to perform loop-invariant optimisations since they can potentially make an enormous speed improvement for numpy. As long as we are smart about compiler flags it doesn't make much difference for the others. That means including -fno-math-errno.

As an example, consider the expression that arises in the Jeffress example: a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau), where only a and v are arrays. Here's what the numpy-based state updaters are doing at the moment, essentially:

_v = a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau)
v[:] = _v

Compare that to this rearranged version to remove the loop invariants in the smartest way I could think of using only numpy:

_sin_term = sin(2.0*freq*pi*t)
_exp_term = exp(-dt/tau)
_a_term = (_sin_term-_sin_term*_exp_term)
_v = v
_v *= _exp_term
_v += a*_a_term
_v += -b*_exp_term + b

Comparing the timings for these against cython and weave (see https://github.com/brian-team/brian2/blob/cython_codegen/dev/ideas/cython_playing_around.py) I get:

cython_modified_inline: 0.13
numpy: 3.11
numpy_smart: 0.70
weave_slow: 3.89
weave_fast: 0.14

So the 'smart' version of numpy is 4.4x faster than the version we have, and numpy goes from being 24x slower than weave/cython to being only 5x slower. The remaining slowness is probably largely due to the memory bound operations, so trying numexpr seems like a good idea. Dumbly applying numexpr gives a slight speed improvement over numpy but not as much as the smart pure numpy version, but using numexpr on the smart version gives a further speed improvement. Here's the smart numexpr code:

_sin_term = sin(2.0*freq*pi*t)
_exp_term = exp(-dt/tau)
_a_term = (_sin_term-_sin_term*_exp_term)
_const_term = -b*_exp_term + b
v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term')
# it would be nice to do this instead, but it gives wrong results, although faster
#numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v)

Here are the times:

cython_modified_inline: 0.13
numpy: 3.13
numpy_smart: 0.70
numexpr: 1.31
numexpr_smart: 0.45
weave_slow: 3.88
weave_fast: 0.13

So now the smart version of numexpr is only 3.4x slower than weave/cython. If you use the out=v version - which gives wrong results but maybe there is a way to make it work - then you get it being only 1.5x slower than weave/cython.

So my thoughts on how to implement this:

We already know which variables are arrays and which are values, so we can pull out any subexpressions involving only values and create temporary variables to replace them. We can also group things like x*a+x*b into x*(a+b) where x is an array. We might be able to use sympy to help with this?

thesamovar commented 10 years ago

I just upgraded my version of numexpr in which the bug with out=v is fixed, and the smart version of numexpr is as fast as weave/cython! In other words, if we do this optimisation we can get state updaters that are as fast as weave without a compiler.

mstimberg commented 10 years ago

Ok, looks like we have plenty of options for generating fast code. I'm not sure we should fiddle around with rearranging code a lot, we might introduce errors this way and get further away from a straightforward translation of abstract code into executed code. But maybe it is worth it. If we do it, I'd say we should let sympy do this job (maybe optional for all state updaters, it can be quite slow for complex expressions). E.g. for

v = a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau)

sympy's cse function suggests rewriting it as:

x0 = exp(-dt/tau))
x1 = a*sin(2.0*pi*freq*t)
x2 = b + x1
v = v*x0 - x0*x2 + x2

which looks reasonable, I guess.

Related to all the perfomance comparisons: I threw numba into the mix, simply using:

@numba.autojit()
def timefunc_numba():
    for _idx in range(N):
        a = _array_neurongroup_a[_idx]
        v = _array_neurongroup_v[_idx]
        _v = a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau)
        v = _v
        _array_neurongroup_v[_idx] = v

and I get a very decent result:

Times
=====
cython_modified_inline: 0.23
numpy: 2.53
numpy_smart: 0.74
numpy_blocked: 0.81
numba: 0.33
numexpr: 1.30
numexpr_smart: 0.24
weave_slow: 5.26
weave_fast: 0.24

thesamovar commented 10 years ago

CSE is nice in that case, but doesn't take into account which operations are vector and which are single valued, which is the critical point for us. These things are optimisations so they're not the highest priority, but now that we know that it's possible to have pure numpy + numexpr running as fast as weave, it seems worth trying for.

Also very nice to see numba does well.

mstimberg commented 10 years ago

CSE is nice in that case, but doesn't take into account which operations are vector and which are single valued, which is the critical point for us.

Is it really the critical point? In the above examples, does it make any difference if freq is a per-neuron value or a scalar? It seems to me that the optimisation would be the same in both cases. It matters for weave, but there we can let the compiler do the work for us.

thesamovar commented 10 years ago

In this case it wouldn't make a difference because it is used twice so it is pulled out as a CSE. But if it was only used once then it wouldn't be pulled out. For example, if a, b, c are constants and x is a vector and the user writes x*a*b*c then this will give 3 vector operations in numpy, but it could be simplified to x*(a*b*c) to make it a single vector operation. CSE simplification won't do that because it doesn't care about vector operations, but we do.

mstimberg commented 10 years ago

Ok, I see. I still have the feeling that it shouldn't be Brian that does this kind of optimisations -- I guess numexpr/numba can handle this kind of optimisation?

thesamovar commented 10 years ago

From the times above, numba does it but numexpr doesn't. I agree it would be ideal if we didn't have to do this sort of thing in Brian, but on the other hand it would be nice to have it running fast without experimental packages like numba.

mstimberg commented 10 years ago

From the times above, numba does it but numexpr doesn't.

The above test does also contain the common sub-expressions, not only the associativity, though. My feeling is to stay with the straightforward, clear optimizations (e.g. pulling out constants, common subexpressions) that give a significant performance increase and are not too complicated to implement. And I'm not sure whether rewriting x*a*b*c does fit into this category. Trying it with numexpr and a 1000000 long vector x the difference is only 2.5ms vs. 2.4ms.

thesamovar commented 10 years ago

CSEs could actually potentially slow it down if you are using with numexpr if they are vector-valued CSEs rather than scalar valued, because they would imply multiple memory reads/writes. But pulling out all function evaluations on scalar values and then applying numexpr might be enough?

thesamovar commented 10 years ago

I just added theano and it wasn't a great success. With freq, etc., as compile-time constants it manages to be about 2-3x slower than weave, but if they are declared as scalar values (as e.g. t would have to be), then the whole thing runs slower than naive numpy. Difficult to understand why/how.

mstimberg commented 10 years ago

I played a bit around with Theano and I'm seeing the same thing, it is very slow (on my machine, weave_slow is already twice as slow as numpy and theano is even slower than that...). I wonder whether it has to do with unnecessary copying, I don't quite understand the details (http://deeplearning.net/software/theano/tutorial/aliasing.html) but using shared variables and sprinkling arround some borrow=True did not help at all... Maybe it's time again for a mail to the mailing list to have some obvious mistake pointed out to us ;) ?

thesamovar commented 10 years ago

Sure - you want to write to them this time? ;)

mstimberg commented 10 years ago

Sure - you want to write to them this time? ;)

Hah, you gave this one away :) I reported it (https://groups.google.com/forum/#!topic/theano-users/Jj_CQT5J3og) and it turned out to be a genuine bug in theano... that was fixed and merged 5 hours after my report! With theano from master the run time is now halfway between weave_fast and numpy_smart, a quite reasonable time I think.

thesamovar commented 10 years ago

Impressively quick of them! In that case maybe it's worth looking into Theano a bit more, because you can do things like merge the computation of several lines. Did you try with your modified versions with shared/borrowed memory etc. to reduce the time even further?

thesamovar commented 10 years ago

OK I just upgraded my gcc to 4.8 and my Theano to the latest git master and now I have:

cython_modified_inline: 0.13
numpy: 3.16
numpy_smart: 0.70
numpy_blocked: 0.82
numexpr: 1.03
numexpr_smart: 0.14
weave_slow: 3.88
weave_fast: 0.14
theano: 0.52

These times match what you said.

mstimberg commented 10 years ago

Did you try with your modified versions with shared/borrowed memory etc. to reduce the time even further?

I tried some things (a and v as shared variables instead of as parameters and using borrow=True) but they didn't have any noticeable effect on the runtime.

thesamovar commented 10 years ago

So I just started a bit of work on this in a new branch loop_invariant_optimisations and it turns out to be pretty simple. See brian2.parsing.loop_invariant_optimisations. I used the fact that at least to start with, we are not allowing writing to scalars, but it wouldn't be too hard to take this into account.

You just have a list of symbols that are either scalar variables, or the names of functions which don't have state (so that if they are given the same value they always give the same response). You go through the expression tree and any time you find a subtree that when rendered has only scalar variables and stateless functions, you pull it out as a loop-invariant. Done! There's also some very minimal common subexpression elimination, in that if you pull out two identical subtrees you only assign one new symbol to them. Done! :)

What still remains to be done is to integrate this into codegen. Getting the scalar variables is easy, but getting the list of stateless functions might require a few changes. I guess we should make it a new attribute of Function?

mstimberg commented 10 years ago

I guess we should make it a new attribute of Function?

This sounds reasonable, we would set it for all the predefined functions but for user-defined functions we'd default to "not stateless" to be on the safe side.

mstimberg commented 10 years ago

So I just started a bit of work on this in a new branch loop_invariant_optimisations and it turns out to be pretty simple. See brian2.parsing.loop_invariant_optimisations. I used the fact that at least to start with, we are not allowing writing to scalars, but it wouldn't be too hard to take this into account.

I didn't look at your implementation yet, but if it is easier than expected, great :) I'd say I'll have a go at #205 again and find out how difficult this "two sections in templates" idea is. I now realized that my original idea that "a template that does not define a scalar block does not allow writing scalars" is not very useful, since we'll need a block for the loop-invariants nevertheless (e.g. in reset statements). Maybe simply have an ALLOWS_SCALAR_WRITE comment in the template?

From the abstract code side, we'd always pass a dictionary (something like code_lines_array, code_lines_scalar).

This all does sound very feasible and it should lead to quite understandable code. Since it will touch all the templates, I guess we should do it rather soon -- but maybe after an initial merge of #89 ...?

thesamovar commented 10 years ago

That all sounds good to me. If we're tying this in wit #205 then maybe we should go ahead and release the next alpha sooner, since some of this stuff will take a little longer to implement. I think there's a lot of good stuff already done and it's worth having it released.

mstimberg commented 10 years ago

That all sounds good to me. If we're tying this in wit #205 then maybe we should go ahead and release the next alpha sooner, since some of this stuff will take a little longer to implement.

I agree. Maybe we should slip in another alpha milestone :) One thing I'd be interested in fixing rather sooner than later (because it can actually lead to incorrect results) is #74 -- given that we already have code for the topological sorting of subexpressions in Equations._sort_static_equations this might need only some refactoring to get working?

thesamovar commented 10 years ago

I'm happy to make another alpha. ;) You're right though that maybe #74 is more pressing and it shouldn't be too difficult to fix, so should be moved to the current alpha?

mstimberg commented 10 years ago

You're right though that maybe #74 is more pressing and it shouldn't be too difficult to fix, so should be moved to the current alpha?

Sounds reasonable, I moved it over.

mstimberg commented 9 years ago

We didn't do all the term rewriting you suggested earlier, but the bulk of this was pulled in via #392, I'll therefore close this issue. See #394, #395, #396 for possible future optimizations, everything else should be a new issue.

thesamovar commented 9 years ago

I think everything here was covered in other issues (theano, numba, etc.) except for numexpr, so I made a new issue for that. Now that we have the LIOs numba could make numpy almost as fast as weave so it could be worth putting this optimisation in.

brian-team / brian2

Loop invariant optimisations #174