atomics in `vector_code` for more parallelisation

denisalevi commented 6 years ago

Currently the synaptic effect application is serialized when there is the possibility of race conditions. But since in many cases memory write conflicts are occurring very rarely, we could keep parallelisation if we use atomics.

Synaptic effect application looks in general like this:

for(int j=threadIdx.x; j<num_synapses; j+=THREADS_PER_BLOCK)
{
    // _idx is the synapse id
    int _idx = propagating_synapses[j];

    {{vector_code}}
}

where propagating_synapses are either the synapses corresponding to the spiking neurons in the eventspace (for homogeneous delays) or the synapses in the current spikequeue (for heterogeneous delays). And THREADS_PER_BLOCK can be either 1 (serialization) or the maximum allowed threads per block (full parallelisation, one thread per synapse).

Now the {{vector_code}} block for a simple synapse model, lets say e.g.

Synapses(neurongroup, neurongroup, on_pre='v += 1')

with neurongroup variable v would look like this

int _postsynaptic_idx = _ptr_array_synapses__synaptic_post[_idx];
double v = _ptr_array_neurongroup_v[_postsynaptic_idx];
v += 1;
_ptr_array_neurongroup[_postsynaptic_idx] = v;

As long as for each _idx (synapses ID) we get a unique _postsynaptic_idx, we can fully parallelise (one thread per synapse). Otherwise the last line can create a race condition and we need to fully serialise (one thread loops through all synapses).

But by changing the vector_code to

int _postsynaptic_idx = _ptr_array_synapses__synaptic_post[_idx];
atomicAdd(&_ptr_neurongroup_v[_postsynaptic_idx], 1);

we could fully parallelize even if we don't have a one-to-one mapping from _idx (synapse ID) to _postsynaptic_idx.

Question is:

In this simple example this works fine, but what happens with other, more complicated models, e.g. where synaptic and postsynaptic variables are updated dependent from each other? Can we always translate those into some kind of atomic operation? And how is this compatible with brian2?

Cases where we could achieve performance gain from this:

heterogeneous delays

Since a single spikequeue is filled with synapses from source neurons having spiked at different time steps (due to different delays), multiple of these synapses might apply effects to the same target neurons and therefore can't be applied in parallel. If we could use atomics for synaptic effect application, we could parallelize.

homogeneous delays with multiple synapses for the same source/target neuron pair

We currently need to serialize since different synapses apply effects to the same target neuron.

homogeneous delays without multiple synapses for the same source/target neuron

Currently for the application of synaptic effects we loop through the eventspace globally and then for each spiking_neuron we parallelize over all synapes of this neuron (since each synapse applies effects to a unique target neuron). With atomics we could parallelise also over spiking neurons (by starting more blocks in parallel).

denisalevi commented 6 years ago

Note: there is no atomicAdd() for double-precision floating-point numbers in GPUs with compute capability < 6.0, but can be implemented as shown here in the cuda programming guide.

denisalevi commented 6 years ago

Update after talking to @mstimberg :

It's generally not difficult to add atomics into the {{vector_code}}. This should be done in the CUDACodeGenerator. The NumpyCodeGenerator does already do something similar by using ufunc.at when there are multiple synapses modifying the same variables. Check out NumpyCodeGenerator.vectorise_code() and translate_one_statement_sequence in CUDACodeGenerator.

It's more a question of when we can actually use atomics and when not. The numpy code generation target already checks for cases where palatalisation is not possible and falls back to python loops (as far as I understood). For the corresponding pull request and discussion see brian-team/brian2#531.

Some preliminary thoughts for brian2cuda:

In cases where we serialize only because of a target variable being modified by multiple threads, e.g. the case of synapse with
```
V_post += const
```
and heterogeneous delays as in BrunelHakimHeterogeneousDelays), we can just use atomics as explained above.
If the right hand side of the modification depends on a previously modified variable, e.g.
```
A_post += const 
V_post += A_post 
```
we could parallelize the first statement with atomicAdd as in 1., but the second statement would need to know the total effect on A_post from all threads. Therefore we would need to synchronize between the two statements or need to serialize. If we kept the post_neuron block structure we currently have, we could use in the target synaptic effect case __syncthreads(), since each block works on its own set of post_neurons (the block structure might be unnecessary in the 1. case though). For compute capability >6.x alternatively cooperative groups could be used.
For other cases we could either just print a warning and serialize or use the palatalisation approach descried in #15.

To identify when we can use atomics and when e.g. we have to use synchronisation between statements, we should check what brian2genn (there atomics are used and the kind of synaptic modifications are quite restricted, but this might be just the set of equations for which we can use atomics too) and check the NumpyCodeGenerator linked above and also check out the will_write and will_read sets defined in make_statements.

moritzaugustin commented 6 years ago

user preferences (at least for our speedtests useful): one value atomics (falls back to non-atomic where non-applicable), the other non-atomic

denisalevi commented 6 years ago

Atomics are implemented, we have a user preference prefs['codgen.generators.cuda.use_atomics'] to turn of atomics usage (default True). If atomics are used, we parallelise

synaptic effect application in all cases were we preveously serialised because of potential race conditions (specifically for heterogenous delays in target mode)
the effect application over spiking neurons in the homogeneous delay case
cases were we have multiple pre/post neuron connections in target mode.

See PR #139 for implementation discussions ans details. Closing.

brian-team / brian2cuda