Improve state updater performance for numpy target

mstimberg commented 8 years ago

As can be seen in the benchmarks here https://github.com/brian-team/brian2/pull/602#issuecomment-160189920, exact integration with numpy is significantly slower in Brian 2 than in Brian1. Quoting myself from that discussion:

I had a quick look at why the exact integration is so slow. There is of course still the issue that in Brian 1, we calculate everything for the numerical values of all constants whereas in Brian 2, we do it symbolically (even a bit to the extreme, where the generated code still contains mvolt, etc. -- this is nice for readability but probably replacing it with the numerical values would be faster). But apart from that, we are generating very inefficient code for the state updater (it is about 25 times slower than the respective code in weave...). Here's what we do for updating v:

_v = (El + ((((_lio_3 * (((ge * (_lio_4 + exp(_lio_5 * int(not_refractory)))) * int(not_refractory)) * exp(_lio_6 * int(not_refractory)))) / (_lio_7 + (taue * int(not_refractory)))) + ((_lio_8 * (((gi * (_lio_9 + exp(_lio_5 * int(not_refractory)))) * int(not_refractory)) * exp(_lio_6 * int(not_refractory)))) / (_lio_7 + (taui * int(not_refractory))))) + (v * exp(_lio_6 * int(not_refractory))))) - (El * exp(_lio_6 * int(not_refractory)))

This evaluates not only int(not_refractory) but also more complex expressions like exp(_lio_6*int(not_refractory) several times (we don't have this issue in C since we now handle boolean expressions differently). One relatively simple solution would be to use sympy's CSE identification. I tried this out and it does not seem to be too heavy in computation (it adds 0.2s to the code generation process on my machine) and it speeds up the code considerably, by a factor of 2. The new code looks like this:

_cse_v0 = int(not_refractory)
_cse_v1 = exp(_cse_v0 * _lio_6)
_cse_v2 = exp(_cse_v0 * _lio_5)
_cse_v3 = _cse_v0 * _cse_v1
_v = (((((- El) * _cse_v1) + El) + (_cse_v1 * v)) + ((((_cse_v3 * _lio_3) * ge) * (_cse_v2 + _lio_4)) / ((_cse_v0 * taue) + _lio_7))) + ((((_cse_v3 * _lio_8) * gi) * (_cse_v2 + _lio_9)) / ((_cse_v0 * taui) + _lio_7))

Quoting Dan:

I think that's actually a regression in my code: when I improved LIO optimisation stuff in the cython performance branch, I also removed the optimisation that simplifies boolean expressions because it wasn't necessary in weave/cython - but that meant that numpy got slower. Could re-enable that, but CSE is also a good idea.

thesamovar commented 8 years ago

Another option is to do an if/then with numpy, i.e. instead of doing _v = int(not_refractory)*val1+(1-int(not_refractory))*val2 we could do _v=empty(n); _v[not_refractory]=...; _v[-not_refractory]=.... Also, it seems quite wasteful for numpy's int function to actually do anything when called on a boolean value/array since it makes a copy but it's unnecessary.

mstimberg commented 8 years ago

Another option is to do an if/then with numpy, i.e. instead of doing _v = int(not_refractory)*val1+(1-int(not_refractory))*val2 we could do _v=empty(n); _v[not_refractory]=...; _v[-not_refractory]=...

It sounds good, but I'd have to read through our full discussion at #156 to make sure this is not something we already thought about and discarded for a good reason :) Either way, I think this would be a bigger change, since we would no longer add the refractoriness information as *int(not_refractory) to the equations (doing that and then later pulling it out again seems a bit silly), so I'd rather say post-2.0?

Also, it seems quite wasteful for numpy's int function to actually do anything when called on a boolean value/array since it makes a copy but it's unnecessary.

That's true in most cases, but are we really sure that using True/False instead of 1/0 will never cause a problem? I can't see any right now (e.g. one potential problem is the use as indices, but for that we would have stored it in an integer array before which triggers the type cast), but we should be sure, it could be a source of bugs that are difficult to debug.

thesamovar commented 8 years ago

I tried replacing the int function with a function that just returns its argument and it made no discernible difference to the speed so I guess let's not do that.

Agreed that it's probably better to do that post-2.0 for the if/then thing in numpy. Actually, maybe we shouldn't worry too much about numpy performance since hopefully most people will be using a version with C++ compilation now. I might put back the boolean optimisation at least since I think that would be quite easy to do.

thesamovar commented 8 years ago

We now get the following code in the state updater for CUBA:

_v = _numpy.where(not_refractory, _lio_11 + (((_lio_12 * ge) + (_lio_13 * gi)) + (_lio_14 * v)), _lio_10 + v)

So the CSE optimisation wouldn't make any difference any more I think. Numpy's state updater is still very slow (1.4s) compared to B1, B2/weave (0.26s) or B2/cython (0.23s), but I think fixing the remaining inefficiencies will be complicated. Here's the full code generated for the CUBA state updater:

DEBUG    brian2.devices.device: neurongroup_stateupdater_codeobject* snippet (scalar):
    dt = _array_defaultclock_dt[0]
    _lio_1 = exp((- dt) / taue)
    _lio_2 = exp((- dt) / taui)
    _lio_3 = taue * exp((- dt) / taue)
    _lio_4 = - exp(dt / taue)
    _lio_5 = dt / taum
    _lio_6 = (- dt) / taum
    _lio_7 = 0.0 - taum
    _lio_8 = taui * exp((- dt) / taui)
    _lio_9 = - exp(dt / taui)
    _lio_10 = El - El
    _lio_11 = El - (El * exp(_lio_6))
    _lio_12 = ((_lio_3 * (_lio_4 + exp(_lio_5))) * exp(_lio_6)) / (_lio_7 + taue)
    _lio_13 = ((_lio_8 * (_lio_9 + exp(_lio_5))) * exp(_lio_6)) / (_lio_7 + taui)
    _lio_14 = exp(_lio_6)
DEBUG    brian2.devices.device: neurongroup_stateupdater_codeobject* snippet (vector):
    ge = _array_neurongroup_ge.copy()
    t = _array_defaultclock_t[0]
    v = _array_neurongroup_v.copy()
    lastspike = _array_neurongroup_lastspike
    gi = _array_neurongroup_gi.copy()
    not_refractory = _array_neurongroup_not_refractory.copy()
    not_refractory = (t - lastspike) > 0.005
    _ge = _lio_1 * ge
    _gi = _lio_2 * gi
    _v = _numpy.where(not_refractory, _lio_11 + (((_lio_12 * ge) + (_lio_13 * gi)) + (_lio_14 * v)), _lio_10 + v)
    ge = _ge
    gi = _gi
    v[not_refractory] = _v[not_refractory]
    _array_neurongroup_ge[:] = ge
    _array_neurongroup_not_refractory[:] = not_refractory
    _array_neurongroup_gi[:] = gi
    _array_neurongroup_v[:] = v

Shall we push this to post-2.0?

mstimberg commented 8 years ago

I'd be interested in how much of the difference is due to the symbolic vs. numerical approach, e.g. does the runtime change significantly when all the constants are replaced with their numerical value? We decided to not do this for weave/cython, to not trigger new code compilation on changes of constants, but for numpy this is not an issue. But in general I think this is good enough for now, it is a fairly specific scenario and most users should be using weave/cython/standalone anyway.

thesamovar commented 8 years ago

I don't think it should make much difference. If you look at the final generated code, it does have a name look up (e.g. _lio_12*ge rather than 0.34324*ge) but I would guess that this makes almost no difference here.

Ah - or do you mean that the computation of the scalars on their own might be taking a substantial amount of time - not just the vector part? Maybe actually. I'm curious enough that I want to find out now...

thesamovar commented 8 years ago

OK, so it takes 1.4s by default on my laptop for CUBA. Making it so that the LIO constants are only computed once reduces that down to 1.0s. Removing the (in this case) unnecessary copies reduces that down to 0.93s. So, it looks like it might be worth (in the future) thinking about evaluating the LIO constants only once (where possible), but it doesn't look like this would lead to orders of magnitude improvement.

thesamovar commented 8 years ago

Removed the 2.0 milestone btw - I think we were in agreement about this.

mstimberg commented 8 years ago

Yup, happy to look into this in more detail post 2.0.

brian-team / brian2

Improve state updater performance for numpy target #603