Investigate occupancy limitation / calculation on MX150 GPU.

For the following example, the stateupdater doesn't achieve full occupancy on my laptop GPU (MX150). Why? Is this a GPU ressource limitation or is there something going wrong in the occupancy calculation?

from brian2 import *

import brian2cuda                # These two lines suffice
set_device('cuda_standalone')    # to run brian2 on a GPU

# Parameters
N = 5000         ; duration = 0.1*second   ; V_r = 10*mV
theta = 20*mV    ; tau = 20*ms             ; delta = 2*ms
tau_ref = 2*ms   ; C = 1000                ; J = 0.1*mV
mu_ext = 25*mV   ; sigma_ext = 1*mV

# Network of N noise-driven leaky integrate-and-fire neurons
model = """
dV/dt = (-V + mu_ext) / tau + sigma_ext / sqrt(tau) * xi : volt
"""
neurons = NeuronGroup(N,
                      model,
                      threshold='V>theta',
                      reset='V=V_r',
                      refractory=tau_ref,
                      method='euler')

# Initialize membrane potential
neurons.V = V_r

run(duration)

This gives

INFO kernel_neurongroup_stateupdater_codeobject
        7 blocks
        768 threads
        36 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        0.750 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_thresholder_codeobject
        5 blocks
        1024 threads
        16 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_resetter_codeobject
        5 blocks
        1024 threads
        14 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)

Why do we use 7 blocks for the stateupdater? How do we get 100% occupancy with only 5 blocks for the the thresholder and resetter if the occupancy calculation says that we need 6 blocks?

To get the (need 6 blocks for 1.000), I printed the min_num_threads variables (which should be called min_num_blocks...).

brian-team / brian2cuda

Investigate occupancy limitation / calculation on MX150 GPU. #208