Performance investigation

Because EventProp is really fast, Amdahl's law once again strikes and CPU-side overheads start to become problematic, especially when training on large datasets like SSC. Training one batch takes approximately 25ms but there's 2ms 'gaps' between batches. With 2359 batches in the training set this corresponds to about 1 minute of actual training computation and 5s of time spend between batches per-epoch. Examining this period in Nsight Systems shows the following (memcpy coming in from the left is readout and only appears massive as it was added to the command queue a long time before - actual time is tiny purple bar):

Biggest blocks of GPU time are:

Batch reduction (300µs)
Spike time memcpy (350µs)

Biggest blocks of CPU time (i.e. GPU idle time) are:

Around asynchronous memcpys to symbol (Alpha, MomentScale1 and MomentScale2 being set on the Adam optimisers associated with 3 connections).

Between end of batch custom updates and spike time memcpy. Standard Python profile of SpikeInput.set_input shows:

ncalls	tottime	cumtime	percall	filename:lineno(function)
2671	0.408	2.543	0.001	spike_input.py:50(set_input)
2671	0.751	1.374	0.001	data.py:186(batch_spikes)
8013	0.711	0.711	0.000	model_preprocessor.py:56(push_to_device)
2671	0.216	0.216	0.000	data.py:211()
2673	0.070	0.179	0.000	shape_base.py:219(vstack)
2673	0.093	0.108	0.000	shape_base.py:81(atleast_2d)

suggesting that, overall, this function accounts for around half of the 5s inter-batch time (matching the Nsight Systems profile) and the Python processing of the PreprocessedSpikes data structure is more expensive than the synchronous CUDA memcpys.

Possible ways to improve these overheads include:

Make spike memcpy asynchronous: this could save around 60µs per-batch by overlapping copying of spike times (first big block) and calling utils.data.calc_start_spikes.
Use CUDA streams and double-buffer spike memcpying: while this could save around 290µs per-batch, the Python processing of PreprocessedSpikes data structures would still be problematic.
Copying multiple batches to GPU: this is what Thomas's code does and would obviously help
Inspired by Spyx, we could replace the spike source array with a simpler model that uses a dense 1-bit tensor to store spikes. Based on 700 input channels, 1000 timesteps and a maximum of 20295 spikes per-example (from SSC) both data formats use around 85KByte per example but 1-bit tensors do not require the same level of processing as the current data structure and could be copied in a single memcpy.
More thought is maybe required about how to dynamic parameter setting more efficient

I think, when balancing performance with attempting to maintain backward-compatibility, adding support for copying multiple-batches to the GPU at a time while keeping the current data structure would probably be the best option.

genn-team / ml_genn

Performance investigation #112