Summed variables: Calculate partial sums in parallel in shared memory instead of using global atomicAdd

PR #186 implements summed variables by parallelizing over all synapses (one thread per synapses in a Synapses object) and computes summed variables using global atomicAdd on postsynaptic variables. This creates conflicts for synapses that connect to the same postsynaptic neuron and global memory read/writes are not coalesced.

Alternatively, we could do this:

An additional alternative (more sophisticated) algorithm would be parallelizing over synapses, where synapses are guaranteed to be ordered by post_id in the first place. E.g. one block per post_id with "outer loop" -- not far away from your described algorithm in the cartoon. This would allow to define synapse start and end index for each post_id and calculation of the summed variable within shared memory. Furthermore the post_id is not required to be loaded from memory as it is implicitly known by the block id. Couldn't we generally order synapses by post synaptic indices after creation if not done already always? (I assume the synapse ID is nowhere explicitly used in any modeling step or is this wrong?). Originally posted by @moritzaugustin in https://github.com/brian-team/brian2cuda/issues/49#issuecomment-687567432

For which we should consider this:

I like this idea. But I think this might easily result in many very small blocks, which could kill performance due to the maximum number of blocks that can run on a SM at the same time. E.g. 1e4 pre neurons connect to 1e4 post neurons with connections probability of 0.01, that means we have ~ 1e4 1e4 0.01 = 1e6 synapses, which are only 1e6 / 1e4 = 100 synapses per post neuron. That would mean calling 1e5 blocks with 100 threads each. One could get around that by not limiting the block to one post neuron but instead fill each block with as many post neurons as fit (without spreading a post neuron across multiple blocks). Originally posted by @denisalevi in https://github.com/brian-team/brian2cuda/issues/49#issuecomment-687619779

For the full discussion, see #49.

brian-team / brian2cuda

Summed variables: Calculate partial sums in parallel in shared memory instead of using global atomicAdd #197