Consider partitioning eventspaces when using `Subgroup`s

denisalevi commented 2 years ago

Using Subgroups in Brian2CUDA can be rather costly because we need to check for each neuron in the eventspace if it belongs to the subgroup during spike propagation / synaptic effect application / spike and rate monitoring (see #283). I'm wondering if we could get around all those issues by partitioning eventspaces based on subgroups with additional counters per subgroup. This could e.g. be achieved by splitting CUDA blocks in the thresholder across subgroups. Synapse computations could then be performed only based a subgroup's partition. Spike recorders could directly copy the spikes from a single partition, and rate recorders would have access to the number of spikes in their partition.

This should probably be optional per NeuronGroup, something like a partition_by_subgroups flag. Because while it might make subgroup computations faster, it would likely slow down computations on the full NeuronGroup:

For no or homogeneous delays, we call the effect application kernel with as many blocks as there are spiking neurons and each block reads one spiking neuron ID. If we partition the evenstpaces, we would have to read all counters to determine the location of a blocks neuron ID. Would require some benchmarking to find out if this makes sense. The memory reads would be broadcast to all threads, but still for many subgroups one would have to perform as many global memory reads.
We wouldn't be able to speed up the resetter as suggested in #272
Spikemonitors would have to first get rid of all the -1 values in the eventspace (or we would have to perform costly copying around of the eventspace during thresholding to remove all -1 then).

Alternatively, one could just encourage using multiple NeuronGroups instead of Subgroups in Brian2CUDA. If concurrent kernel execution is implemented, that might be just the easier way and would just produce exactly the partitioning I'm talking about. For too many NeuronGroups though, this might increase compile times significantly as long as we keep one source file per codeobject. This would be especially relevant for denisalevi/brian2-network-multiplier.

mstimberg commented 2 years ago

Is this really much of a problem in practice? Subgroups only affect SpikeMonitor and PopulationRateMonitor, things like spike propagation / synaptic effect application etc. are unaffected. Subgroups are a convenience tool for the user, the stored indices are always the indices of the underlying full group. The synaptic update code that runs in the examples below would be exactly the same, for example:

G1 = NeuronGroup(10, 'v : 1', threshold='True')
S1 = Synapses(G1, G1, on_pre='v_post += 1', name='no_subgroups')
S1.connect(i=[3, 4, 5], j=[3, 4, 5])

G2 = NeuronGroup(10, 'v : 1', threshold='True')
S2 = Synapses(G2[3:6], G2[3:6], on_pre='v_post += 1', name='subgroups')
S2.connect(i=[0, 1, 2], j=[0, 1, 2])

The only exception would be code that directly refers to the indices i and j – these will have an additional addition/subtraction when using subgroups that do not start at index 0.

denisalevi commented 2 years ago

It also effect spike propagation for heterogeneous delays and effect application for no or homogeneous delays in Brian2CUDA. We have one CUDA block per spiking neuron (and per connectivity matrix partition), because everything happens in parallel over all spiking neurons. Therefore, the number of CUDA blocks in the corresponding kernel depend on the number of spikes in the source group, which we read from the eventspace before launching the kernel. For subgroups, we call as many blocks (times connectivity matrix partitions) as there are spikes in the entire eventspace, and each CUDA block that got a spiking neuron that is not part of the subgroup does nothing.

This and the monitors can definitely have an effect on performance. But you are right, I haven't benchmarked this in detail. But seeing that the monitors on non-subgroups are already relatively expensive in Brian2CUDA, it might be worth keeping this in mind.

I came across this mostly in the context of brian2-network-multiplier, where depending on the implementation, I would expect this to become relevant. See https://github.com/denisalevi/brian2-network-multiplier/issues/2

mstimberg commented 2 years ago

Maybe I am missing something here, but I still do not get how this relates to subgroups. If you have 1 million neurons in a group and connect only, say, the last 100 of them to other neurons, why does it matter if you do this via a Subgroup of size 100 or by connecting the indices 999899... of the full group? At least in C++ standalone mode, the exact same code will be run in both cases.

denisalevi commented 2 years ago

Ah yes, now I understand. You are right. As it is right now, there would be no difference between the two cases you show above. In fact, for both cases, one would first assume that all source neurons are required for synapse computations, call the corresponding number of CUDA blocks, and then only proceed for the ones that actually have synapses (for S1) or are in the subgroup and have synapses (for S2).

So, in both cases Brian2CUDA's implementation is pretty bad if you have 1 million neurons and only 100 of them have synapses, because you will launch ~10.000 times as many CUDA blocks as you actually need. Partitioning the eventspace would then allow to only call as many CUDA blocks as there are spiking neurons in the subgroup. This would make connecting subgroups in Brian2CUDA more efficient than connecting subsets of indices from the full neurongroup. But maybe this needs some benchmarking first to see if it is as bad for performance as it sounds. Maybe its not worth it. Or maybe there are other ways to fix that, e.g. some dynamic parallelism approach with as many threads as spiking neurons in the neurongroup, which spawn new kernels only for neurons that are in a subgroup and have synapses.

brian-team / brian2cuda

Consider partitioning eventspaces when using `Subgroup`s #284