Open denisalevi opened 8 years ago
Looking at some detailed benchmarking results, the STDP speed test with spikemonitor has a weird performance drop at N = 10^5
neurons. And looking at the profiling measurement the spikemonitor is taking up the most time. This is surely because the spikemonitor records all spikes from all those N
neurons, but still the drop I can't quite understand.
The spikemonitor currently just loops through the spikespace and pushes spiking neuron IDs into a cudaVector
. So maybe we just hid a point, where we end up having to reallocate a huge cudaVector
and that is why we have that drop? Needs some more detailed profiling for N = 10^5
and maybe N = 5 x 10^5
And then I just had a look at the spikemonitor and realized we are serializing the recording of spiking neuron IDs, the kernel is called with 1 thread
and 1 block
. Can't we just reserve enough memory in the cudaVector
s (since we know the number of spiking neurons at each timestep anyways) and then just write the recorded variables in parallel (thread <-> spiking neuron
)?
Profiling
IF_curve_LIF.py
andIF_curve_Hodgkin_Huxley.py
examples frombrian2
shows that the performance bottleneck is the spikemonitor kernel.Below are profiling results (using
nvprof
) for three different implementation szenarios.issue10_sorted_spikemonitor
, 609005d)atomicAdd
s from issue #9) with the spikemonitor from 1. (that assumes the parallel spikespace) (branchissue9_spikespace
, eb215d4)Some of the results using
N=10000
neurons in theIF_curve_LIF.py
example:times are [average times per kernel call] and percentages are [time spent in that kernel (all kernel calls) / total time]:
That means:
My conclusions:
-1
(if the neuron didn't spike) orthreadID
if it spiked and just write it to the spikespace atidx=threadID
. This way the spikespace has a lot of-1
values inbetween neuronID values, but we don't need any atomics. Then use thethrust::copy_if
function, that returns a vector of all not--1
values from the spikespace and the size of that vector (using some parallel GPU algorithm). And then just add that returned vector to a the spikemonitor (which can then just bethrust::device_vector
). We should talk about this approach.Below are the detailed profiling results:
Profiling of
IF_curve_Hodgkin_Huxley.py
for different number of neuronsN
, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):Profiling of
IF_curve_LIF.py
for different number of neurons `N``, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):After a quick and dirty change of the spikemonitor (adapted for not parallel spikespace) (2.):
Using the parallel spikespace (filled in parallel by the modified thresholder) and the parallel spikemonitor (3.):