brian-team / brian2cuda

A brian2 extension to simulate spiking neural networks on GPUs
https://brian2cuda.readthedocs.io/
GNU General Public License v3.0
60 stars 12 forks source link

Spikemonitor kernel is performance bottleneck for networks without synapses. #21

Open denisalevi opened 8 years ago

denisalevi commented 8 years ago

Profiling IF_curve_LIF.py and IF_curve_Hodgkin_Huxley.py examples from brian2 shows that the performance bottleneck is the spikemonitor kernel.

Below are profiling results (using nvprof) for three different implementation szenarios.

  1. The unchanged not-parallel spikespace, using the unchanged spikemonitor (that is assuming a parallel spikespace) (dee9bf7)
  2. A quick and dirty modified spikemonitor, that now assumes a not-parallel spikespace (and should now also return sorted data automatically) (branch issue10_sorted_spikemonitor, 609005d)
  3. The parallel spikespace (from the slower thresholder implementation with shared atomicAdds from issue #9) with the spikemonitor from 1. (that assumes the parallel spikespace) (branch issue9_spikespace, eb215d4)

Some of the results using N=10000 neurons in the IF_curve_LIF.py example:

Kernel 1. 2. 3.
run_spikemonitor 280 us (78%) 250 us (79%) 213 us (73%)
copy_spikemonitor 670 ms (19%) 540 ms (17%) 665 ms (23%)
thresholder 2.6 us (0.7%) 2 us (0.8%) 5.3 us (1.8%)

times are [average times per kernel call] and percentages are [time spent in that kernel (all kernel calls) / total time]:

That means:

My conclusions:

Below are the detailed profiling results:

Profiling of IF_curve_Hodgkin_Huxley.py for different number of neurons N, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):

HODGKIN HUXLEY, N=100
==10548== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.48%  291.81ms     20000  14.590us  13.440us  17.024us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
 24.20%  134.56ms     20000  6.7270us  1.6640us  30.271ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 10.57%  58.748ms     60024     978ns     928ns  2.6880us  [CUDA memcpy HtoD]
  8.40%  46.715ms     20000  2.3350us  1.6960us  2.7840us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  4.16%  23.111ms         1  23.111ms  23.111ms  23.111ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  0.16%  878.66us         1  878.66us  878.66us  878.66us  _run_spikemonitor_codeobject_init(unsigned int)
  0.01%  80.896us         1  80.896us  80.896us  80.896us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.01%  77.632us        18  4.3120us  2.0800us  26.752us  [CUDA memcpy DtoH]
  0.00%  5.3120us         1  5.3120us  5.3120us  5.3120us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.8080us         1  3.8080us  3.8080us  3.8080us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, int*, double*)

HODGKIN HUXLEY, N=1000
==31099== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 57.69%  828.03ms     20000  41.401us  1.5680us  225.32ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 18.55%  266.23ms     20000  13.311us  13.088us  16.128us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, char*, double*, double*, double*, double*)
 16.26%  233.40ms         1  233.40ms  233.40ms  233.40ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  4.04%  57.938ms     60024     965ns     928ns  3.1680us  [CUDA memcpy HtoD]
  3.36%  48.259ms     20000  2.4120us  1.7280us  3.0720us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.06%  876.13us         1  876.13us  876.13us  876.13us  _run_spikemonitor_codeobject_init(unsigned int)
  0.03%  434.05us        18  24.113us  2.1120us  251.49us  [CUDA memcpy DtoH]
  0.01%  81.472us         1  81.472us  81.472us  81.472us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.2480us         1  5.2480us  5.2480us  5.2480us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)

Profiling of IF_curve_LIF.py for different number of neurons `N``, using the unmodified spikemonitor (that assumes a parallel spikespace) (1.):

LIF, N=1000
==32636== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 58.81%  235.41ms     10000  23.540us  1.3760us  60.282ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 16.55%  66.264ms         1  66.264ms  66.264ms  66.264ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  8.37%  33.502ms     10000  3.3500us  3.1680us  4.3200us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  7.30%  29.222ms     30021     973ns     928ns  4.8640us  [CUDA memcpy HtoD]
  5.08%  20.352ms     10000  2.0350us  1.4400us  2.6240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.60%  14.400ms     10000  1.4400us  1.3120us  1.9200us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.22%  879.91us         1  879.91us  879.91us  879.91us  _run_spikemonitor_codeobject_init(unsigned int)
  0.04%  151.84us        15  10.122us  2.1120us  73.184us  [CUDA memcpy DtoH]
  0.02%  81.056us         1  81.056us  81.056us  81.056us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.5360us         1  5.5360us  5.5360us  5.5360us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.3920us         1  3.3920us  3.3920us  3.3920us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)

LIF, N=10000
==1277== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 78.12%  2.79040s     10000  279.04us  1.2800us  901.28ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 18.71%  668.47ms         1  668.47ms  668.47ms  668.47ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  1.11%  39.806ms     10000  3.9800us  3.9040us  5.4400us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.82%  29.177ms     30021     971ns     928ns  28.000us  [CUDA memcpy HtoD]
  0.72%  25.781ms     10000  2.5780us  1.7600us  3.4240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.44%  15.810ms     10000  1.5810us  1.5040us  2.2400us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.05%  1.7333ms        15  115.56us  2.1120us  1.2212ms  [CUDA memcpy DtoH]
  0.02%  881.92us         1  881.92us  881.92us  881.92us  _run_spikemonitor_codeobject_init(unsigned int)
  0.00%  81.280us         1  81.280us  81.280us  81.280us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)

After a quick and dirty change of the spikemonitor (adapted for not parallel spikespace) (2.):

LIF, N=1000, SERIALIZED SPIKEMONITOR
==15351== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 58.65%  214.83ms     10000  21.482us  1.5360us  55.489ms  _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 14.67%  53.725ms         1  53.725ms  53.725ms  53.725ms  _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
  9.15%  33.501ms     10000  3.3500us  3.1360us  4.3520us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  7.98%  29.222ms     30021     973ns     928ns  7.1680us  [CUDA memcpy HtoD]
  5.56%  20.374ms     10000  2.0370us  1.4400us  2.6240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.91%  14.317ms     10000  1.4310us  1.3440us  1.9520us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.04%  151.36us        15  10.090us  2.1120us  73.280us  [CUDA memcpy DtoH]
  0.02%  77.632us         1  77.632us  77.632us  77.632us  _run_debugmsg_spikemonitor_codeobject_kernel(void)
  0.01%  53.280us         1  53.280us  53.280us  53.280us  _run_spikemonitor_codeobject_init(void)
  0.00%  3.2960us         1  3.2960us  3.2960us  3.2960us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)

LIF, N=10000, SERIALIZED SPIKEMONITOR
==11854== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 79.31%  2.51796s     10000  251.80us  1.5680us  827.75ms  _run_spikemonitor_codeobject_kernel(unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 17.06%  541.50ms         1  541.50ms  541.50ms  541.50ms  _copy_spikemonitor_codeobject_kernel(int*, double*, bool)
  1.28%  40.739ms     10000  4.0730us  3.9680us  5.3760us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.92%  29.247ms     30021     974ns     928ns  28.032us  [CUDA memcpy HtoD]
  0.81%  25.818ms     10000  2.5810us  1.6640us  3.4240us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  0.56%  17.714ms     10000  1.7710us  1.6000us  2.1440us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.05%  1.7398ms        15  115.99us  2.1120us  1.2275ms  [CUDA memcpy DtoH]
  0.00%  76.928us         1  76.928us  76.928us  76.928us  _run_debugmsg_spikemonitor_codeobject_kernel(void)
  0.00%  53.568us         1  53.568us  53.568us  53.568us  _run_spikemonitor_codeobject_init(void)

Using the parallel spikespace (filled in parallel by the modified thresholder) and the parallel spikemonitor (3.):

LIF, N=1000, PARALLEL THRESHOLDER
==16310== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.58%  268.46ms     10000  26.846us  1.3760us  15.050ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 15.26%  67.602ms         1  67.602ms  67.602ms  67.602ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  8.03%  35.578ms     10000  3.5570us  3.2960us  4.4160us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  6.64%  29.410ms     30021     979ns     928ns  3.5200us  [CUDA memcpy HtoD]
  5.62%  24.903ms     10000  2.4900us  1.7280us  5.1520us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  3.62%  16.056ms     10000  1.6050us  1.4080us  1.8880us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.20%  876.83us         1  876.83us  876.83us  876.83us  _run_spikemonitor_codeobject_init(unsigned int)
  0.03%  151.74us        15  10.116us  2.1120us  73.248us  [CUDA memcpy DtoH]
  0.02%  81.664us         1  81.664us  81.664us  81.664us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
  0.00%  5.6640us         1  5.6640us  5.6640us  5.6640us  _count_spikemonitor_codeobject_kernel(double*, int*, int*, int*, int, int*, double*, int, int*, int*, unsigned int, unsigned int*)
  0.00%  3.2640us         1  3.2640us  3.2640us  3.2640us  _kernel_neurongroup_group_variable_set_conditional_codeobject(unsigned int, unsigned int, double*, int*)

LIF, N=10000, PARALLEL THRESHOLDER
==16906== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 72.62%  2.13823s     10000  213.82us  1.3760us  112.15ms  _run_spikemonitor_codeobject_kernel(unsigned int, unsigned int, unsigned int, int*, double*, int*, int*, int*, int, int*, double*, int, int*, int*)
 22.60%  665.34ms         1  665.34ms  665.34ms  665.34ms  _copy_spikemonitor_codeobject_kernel(int*, double*, unsigned int)
  1.80%  53.087ms     10000  5.3080us  2.0160us  18.304us  kernel_neurongroup_thresholder_codeobject(unsigned int, int*, double*, double*, double*, char*)
  1.36%  39.906ms     10000  3.9900us  3.8720us  5.1520us  kernel_neurongroup_stateupdater_codeobject(unsigned int, double*, double*, double*, double*, double*, char*)
  0.99%  29.177ms     30021     971ns     928ns  28.128us  [CUDA memcpy HtoD]
  0.54%  15.887ms     10000  1.5880us  1.5360us  1.9840us  kernel_neurongroup_resetter_codeobject(unsigned int, int*, char*, double*)
  0.06%  1.7550ms        15  117.00us  2.1120us  1.2425ms  [CUDA memcpy DtoH]
  0.03%  879.14us         1  879.14us  879.14us  879.14us  _run_spikemonitor_codeobject_init(unsigned int)
  0.00%  81.632us         1  81.632us  81.632us  81.632us  _run_debugmsg_spikemonitor_codeobject_kernel(unsigned int)
denisalevi commented 7 years ago

Looking at some detailed benchmarking results, the STDP speed test with spikemonitor has a weird performance drop at N = 10^5 neurons. And looking at the profiling measurement the spikemonitor is taking up the most time. This is surely because the spikemonitor records all spikes from all those N neurons, but still the drop I can't quite understand.

The spikemonitor currently just loops through the spikespace and pushes spiking neuron IDs into a cudaVector. So maybe we just hid a point, where we end up having to reallocate a huge cudaVector and that is why we have that drop? Needs some more detailed profiling for N = 10^5 and maybe N = 5 x 10^5

And then I just had a look at the spikemonitor and realized we are serializing the recording of spiking neuron IDs, the kernel is called with 1 thread and 1 block. Can't we just reserve enough memory in the cudaVectors (since we know the number of spiking neurons at each timestep anyways) and then just write the recorded variables in parallel (thread <-> spiking neuron)?