brian-team / brian2cuda

A brian2 extension to simulate spiking neural networks on GPUs
https://brian2cuda.readthedocs.io/
GNU General Public License v3.0
61 stars 12 forks source link

Investigate occupancy limitation / calculation on MX150 GPU. #208

Open denisalevi opened 3 years ago

denisalevi commented 3 years ago

For the following example, the stateupdater doesn't achieve full occupancy on my laptop GPU (MX150). Why? Is this a GPU ressource limitation or is there something going wrong in the occupancy calculation?

from brian2 import *

import brian2cuda                # These two lines suffice
set_device('cuda_standalone')    # to run brian2 on a GPU

# Parameters
N = 5000         ; duration = 0.1*second   ; V_r = 10*mV
theta = 20*mV    ; tau = 20*ms             ; delta = 2*ms
tau_ref = 2*ms   ; C = 1000                ; J = 0.1*mV
mu_ext = 25*mV   ; sigma_ext = 1*mV

# Network of N noise-driven leaky integrate-and-fire neurons
model = """
dV/dt = (-V + mu_ext) / tau + sigma_ext / sqrt(tau) * xi : volt
"""
neurons = NeuronGroup(N,
                      model,
                      threshold='V>theta',
                      reset='V=V_r',
                      refractory=tau_ref,
                      method='euler')

# Initialize membrane potential
neurons.V = V_r

run(duration)

This gives

INFO kernel_neurongroup_stateupdater_codeobject
        7 blocks
        768 threads
        36 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        0.750 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_thresholder_codeobject
        5 blocks
        1024 threads
        16 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_resetter_codeobject
        5 blocks
        1024 threads
        14 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)

Why do we use 7 blocks for the stateupdater? How do we get 100% occupancy with only 5 blocks for the the thresholder and resetter if the occupancy calculation says that we need 6 blocks?

To get the (need 6 blocks for 1.000), I printed the min_num_threads variables (which should be called min_num_blocks...).

denisalevi commented 2 years ago

See my explanations in #266. We use 36 registers, that means we can't run 2048 threads per block due to registers per SM limits (would need 32 registers per thread for that). Hence we use less threads than 1024, leading to lower theoretical occupancy.

The occupancy value is a theoretical occupancy per SM, so it is 100% independent of number of blocks. But to actually fully use all SMs, one would need 6 blocks here (since there are 3 SMs that can run 2 blocks each on the MX150).

TODO: Modify the info message to say "theoretical occupancy per SM", to make this distinction clearer.