Parallelize synapse creation

Our synapse generation uses the exact same algorithm as cpp_standalone, all running on the host. Since synapse creation only happens ones before the network is run, it doesn't matter much for most networks. But for large networks, synapse creation can take some time. And the cleaner solution would be to generate synapses directly on the device, parallelize the algorithm (if that is possible, need to check the implementation) and at the same time avoid some of the unnecessary host/device/host copying that is happening.

See also #88 for another slow synapse generation issue.

If we used device side synapse generation, we could get rid of the host side random number buffer (in brianlib/curand_buffer.h) and use a device side buffer (which we could then also use for our binomial function implementation) or use the curand device side API for on-the-fly RNG.

brian-team / brian2cuda

Parallelize synapse creation #177