Consider using OpenMP in code generation

yger commented 10 years ago

Hi all

I'm not sure this is so 'easy', but this is worth giving a try, so I created a local branch to implement the function. Actually, a naive #pragma omp parallel for in front of all the loops is not speeding anything, and is more slowing down everything. I never used openmp before, but as far as I understood, this is each time you are doing that, you are creating threads that are destroyed after the loops, then recreated, and so on at every time steps. The overhead is enormous. So the good solution I'm aiming for is a single parallel context where threads are created only once, at the beginning of the simulation, and then kept alive all the time. I already started to draft stuffs, and will test it more carefully to see what is the speedup that one can observe in standalone code.

We'll see

Pierre

thesamovar commented 10 years ago

Hey Pierre, thanks for looking at this. It would be very cool if you could make this efficient. My understanding was that thread creation - unlike process creation - was very lightweight and efficient so that the overheads weren't too bad, but I didn't investigate this very thoroughly. Another possibility is that by using one thread per index you lose the SSE instructions capable of doing 2 or 4 indices simultaneously. This could be handled by parallelising in groups of 2 or 4 maybe. Another possibility is that since the operation is largely memory bound rather than compute bound, we shouldn't expect much speed improvement.

yger commented 10 years ago

Hi all

I had a quick look this weekend, and I have something working. However, so far, I wouldn't say it's "efficient". I have to benchmark it more properly, but for small networks, so far, I can not really notice any improvements: run time is about the same with one or multiple threads (and this is already better than just adding #pragma omp parallel for everywhere). Note that I focussed only on the runtime, and did not optimize the construction of the network. I'll double check it and see if there is something interesting for larger number of neurons and/or synapses, and also that results are really what they are supposed to be. I'll let you know, because maybe you're right about the memory bandwith. So far, as far as I understood, clock-triggered operations that can not be multi threaded are spike propagation and monitors (structures using DynamicArrays).

Best

Pierre

yger commented 10 years ago

I'll try to create a bunch of tests/benchmarks for my implementation, but few preliminary tests are promising on larger networks. One second of a classic CUBA network with 10000 cells and 2% connectivity is simulated in 81s with only one thread, and 30s with 4 threads (sublinear speedup, but still). Connectivity and number of spikes are similar at the end, but this has to be more thoroughly quantified.

thesamovar commented 10 years ago

Very nice work Pierre! :)

Spike propagation and so forth will be more tricky to parallelise (they're not embarrassingly parallel) but there may be ways to make use of multiple threads there too with a bit of work.

yger commented 10 years ago

Thanks :-) Happy to try to bring something to your nice project ! So I have a working branch, with an openmp implementation up and running. Attached are some tests made with 2 toy networks: the STDP_standalone file, and a CUBA file that I made based on the example from Brian1, but with 10000 cells. Note that times given are only a global indication, because they also include the compilation time: we should only look for relatives changes. Objects that I can not simple make parrallel are StateMonitors, SpikeMonitors, Thresholds, and SpikePropagation. Gains are not always present (weak effect for the small STDP example, see Figures attached), but can be stronger for larger networks (CUBA). My bet is that this is heavily depending on number of cells/synapses, and deeper investigation is needed. However, results seems to be consistent accross the number of threads (random numbers are not controlled between runs, as this is quite preliminary work), and there is a gain ! Long life to Brian2 !

stdp_openmp cuba_openmp

thesamovar commented 10 years ago

Hey Pierre I took a look at your code and it seems not too complicated (which is great!). If we use OpenMP with one thread is it equivalent to not using OpenMP? We should maybe add an option to switch OpenMP support off/on.

yger commented 10 years ago

Hi here

Yes, code is fairly simple, the parallel environement for now is just created around the run calls. Problem is that when adding a new template, one need to manually make it openmp compatible, but I guess this can not really be simplified... I'm now trying to see if object creation could also make use of multiple threads (especially synapses creation). I did not benchmark one-threads openmp versus no openmp, and I agree that we should have an option to turn it off. The way it is currently implemented in the build() fonction of device.py is a bit dirty, I just made that for testing it myself. Maybe we could have something like an argument in this build() command that would be openmp=False (default), if True then we could let openmp decide the number of threads, and if an int is provided, then this particular number of threads is used. What do you think ? Otherwise, we would need two variables, one for turning on/off openmp, and a second one for the number of threads, so I was thinking we could merge the two, even if we'll mix variables types. So up to you.

I'll polish a bit the code, and try some tests with objects creation. Will let you know !

mstimberg commented 10 years ago

Problem is that when adding a new template, one need to manually make it openmp compatible, but I guess this can not really be simplified...

As we briefly discussed, I think it would be good to add an indicator to templates to show that they are "OpenMP-aware" -- this way we could raise an error if we come across a template that does not have this flag. On the other hand, if we want this as an integral part of standalone we could maybe simply require all templates to be written with it in mind.

Maybe we could have something like an argument in this build() command that would be openmp=False (default), if True then we could let openmp decide the number of threads, and if an int is provided, then this particular number of threads is used. What do you think ? Otherwise, we would need two variables, one for turning on/off openmp, and a second one for the number of threads, so I was thinking we could merge the two, even if we'll mix variables types. So up to you.

I think we could either have a Brian preference or a keyword for build. The preference would have the advantage that we can re-use it for weave code (but since we cannot have persistent threads in this case, maybe it is not worth it in the first place?). And in this case I would opt for two separate options. If we have a keyword, I would rather not mix boolean with int (except if True==1 and False==0), maybe openmp_threads as a keyword with -1 for an automatic number of threads, 0 for not using it and >0 for a fixed number of threads? This has the advantage that openmp_threads=False would also switch it off.

thesamovar commented 10 years ago

As we briefly discussed, I think it would be good to add an indicator to templates to show that they are "OpenMP-aware"

I haven't looked at the code in detail, but could we have some parts that are and some parts that are not OpenMP aware? I think the decision here should depend on how much complexity adding OpenMP-awareness to a template takes. If it's simple to make it OpenMP aware (at least, without necessarily doing any clever with threads), then let's make all templates be aware of it.

I think we could either have a Brian preference or a keyword for build.

Agreed! For the number of threads, a convention I've used in the past is ncpu>0 means use precisely that number of cores (including ncpu=1 for single process/thread), and ncpu<0 means max_num_cpus-ncpu). The only weird thing is that using the maximum number of CPUs requires you specify ncpu=0. What do you think?

mstimberg commented 10 years ago

I haven't looked at the code in detail, but could we have some parts that are and some parts that are not OpenMP aware?

The problem is that non-OpenMP aware code might lead to crashes (or worse, corrupted data) if it e.g. pushes things to a vector because the separation into threads is done globally. I therefore thought that we should rather err on the side of not running some code if it is not "guaranteed" to work. But since it probably only means adding one #pragma omp single to code where we are not certain about its thread-safety, it's probably fine to just take it into account everywhere.

Agreed! For the number of threads, a convention I've used in the past is ncpu>0 means use precisely that number of cores (including ncpu=1 for single process/thread), and ncpu<0 means max_num_cpus-ncpu). The only weird thing is that using the maximum number of CPUs requires you specify ncpu=0. What do you think?

Using negative numbers in this way is quite nice, on a machine you use for other tasks as well you probably will use -1 or -2 quite often. The 0 is not very intuitive but I think ok. Is the "automatically decided number" that OpenMP choses the same as the full number of cores? If not, we'd need yet another way to chose this option... Otherwise, I'd be fine with it, we could use None for switching it completely off, I guess.

thesamovar commented 10 years ago

Agreed on all of that, not sure about the automatically decided number of OpenMP threads - @yger?

yger commented 10 years ago

I'm away from conference, actually just arrived jetlag in San Diego, so I won't work too much the next few days on the topic and I'll be short :-) I agree with all that have been said. Best would me for me to discuss with Marcel when back in the lab about those "openmp aware templates", because even if I see the point, I don't really know how to code it properly within the brian2 framework. Same with the build() option or the preference. For now, I'll let the code as it is, and as you said, this is really not "too complicated". For most of the devices, this is just adding #pragma omp ... before loops that can safely be parallel, and #pragma omp single when you're not sure (some loops accessing structures such as vectors can't be easily multithreaded, and then decisions has to be made). Otherwise, I kind of like the idea of having -1 for auto mode (we let openmp decide how many threads as a function of the load), 0 for disabling openmp, and ncpu > 0 for a given number of threads. Actually, there will be others questions that I would need to ask you guys when back, because even if this is kind-of-working right now, Marcel is right about the fact that thread-safety should be better tested. Especially for synapses. So I'll design tests addressing those issues

mstimberg commented 10 years ago

Just discussed something with @yger and writing it down here for future reference: the biggest problem for parallel execution is synaptic code that updates pre-/postsynaptic variables. If it doesn't, it can be executed in parallel without problems. The check whether it does or doesn't can be done in the template (as in the numpy template), by checking that all variables only use the _idx index (some more optimization is possible, though, only reading from pre/postsynaptic variables is fine). For the more complicated use case (writing to neuronal variables), an easy workaround for better performance might be to use the summed variable mechanism, i.e. replace:

G = NeuronGroup(N, '...g : siemens', ...)
S = Synapses(G, G, '...', pre='g+=w')

by

G = NeuronGroup(N, '...g : siemens', ...)
S = Synapses(G, G, '''g : siemens...
                      g_post = g (summed)''', pre='g+=w')

This would then allow for the synaptic update to be done in parallel with an additional (non-parallel) summation step afterwards. Pierre will run some benchmarks on this.

thesamovar commented 10 years ago

Right, but most synaptic code will update pre/postsynaptic variables or its pretty pointless. ;)

Will be interested to see how breaking it into two phases works though. If it works well it might be something we could consider for GPU implementations too.

mstimberg commented 10 years ago

Right, but most synaptic code will update pre/postsynaptic variables or its pretty pointless. ;)

Sure, but in the standard STDP model, for example, only the "pre" pathway does, "post" only updates synaptic variables.

Will be interested to see how breaking it into two phases works though. If it works well it might be something we could consider for GPU implementations too.

Indeed, I don't have a good intuition whether this really is a good idea. Another thing we might consider is detecting cases where parallelization is not a problem because each target as at most one incoming synapse (e.g. when using a one-to-one pattern).

Either way, I think the "two phases approach" is a good start since it is straightforward to use and doesn't require changes to code generation. We can always improve on it later.

yger commented 10 years ago

Hi all

So I implemented what we spoke about with Marcel, i.e a template based solution that will detect if synapses are doing pre/psot only updates. If so, then those updates are handled in parallel, splitted over multiple cores. If not, then only one node is doing the job, in a safe manner. Concerning the summed variable, I also did benchmark for a 1000 all-to-all network with plastic synapses and controlled randomness in order to be sure that results were consistent over multiple number of threads, and to compare also a 'summed variable liked' implementation versus a naive implementation. Good news is that results are the same irregardless of the implementation or the number of threads, so this is already a good point :-) OpenMP implementation is then consistent. Regarding the speed, multi-threading is speeding up the simulation, although not drastically for that simple current-based neuron model. Gain is at max 2x faster. Implementation with summed variable is also slower, but I'll make more plots to quantify that

Best,

Pierre

yger commented 10 years ago

I m also now trying to profile the c++ standalone code in order to see where most of the the time is spent, and if some more optimizations could be achieved with (or without) OpenMP. I'll let you know !

Pierre

thesamovar commented 10 years ago

Sounds good, I'd be interested to see the plots. C++ standalone could surely have some optimisations since we didn't really work on that at all yet.

Another possibility is that we might be able to do some parallelisations on multiple CPUs that aren't particularly efficient on GPU because the memory model is different. For example, I have no idea how slow atomic operations are on CPU but maybe they are fast enough that doing atomic operations on postsynaptic neurons would be efficient?

yger commented 10 years ago

By the way, with openmp, atomic operations are something like 25x slower than "real" operations, and more complex section of the code that would require lock (keyword critical) is said to be around 200x slower. So starting to use those operation is destroying all the advantages of the parallelism. However, on GPU, I don't know what is the cost of such operations

thesamovar commented 10 years ago

Is that true in all cases or only where two threads want to write to the same bit of memory at the same time? This will often be a rare event.

mstimberg commented 10 years ago

I don't think in the current framework atomic operations are what we need. The critical bit of synaptic code (e.g. for v+=w) looks like:

int _postsynaptic_idx = _array_neurongroup_i[_idx];
double v = _array_neurongroup_v[_postsynaptic_idx];
double w = _dynamicarray_synapses_w[_idx];
v += w;
_array_neurongroup_v[_postsynaptic_idx] = v;

Even if the last write were atomic, we'd write v based on a wrong value if another thread updated the same neuron's v in between. And I think if we surround everything with critical, then every benefit of parallelization will completely disappear.

thesamovar commented 10 years ago

Good point it would have to be a more fundamental rewrite. I feel like this might be an easier problem than doing it on GPU because the memory access patterns aren't so critical, but I don't know so much about multi-CPU programming so I'm not sure what would be optimal.

yger commented 10 years ago

Fully agree. This is why I gave up with those atomic operations. Actually, I'm also new to OpenMP, so this is an exploratory implementation :-). Attached, you'll find figures for two implementation of a 1000 cells network with an all-to-all connectivity and plastic synapses, with the current status of OpenMP. The first implementation is using summed variable, the second isn't (and is multi-threaded only when no pre/post variables are involved). The details of the panels (vm traces, w traces, spikes) is not crucial, what matter most is that the different colors, for the different number of threads, are superimposed. The key point to look at is the speed, bottom right panel. Again, what matters is relative changes, because I'm not properly timing building/running time. But see the difference of scale (not summed code takes about 5s to run, summed code about 35s). But results are the same :-)

summed_variables

not_summed_variables

mstimberg commented 10 years ago

Just to make sure, I understand correctly: Using #pragma omp single for "problematic" synaptic code is much faster than doing it in parallel with an additional (non-parallel) summing step afterwards. Correct?

yger commented 10 years ago

That I did not benchmarked properly. Depend on the block of code you have to execute, and if this block is involving numerous operations or not. My intuition would be that this #pragma omp single is faster for small networks and simple models, but that this could change for more complex celltype or larger number of synapses... useless answer, then....

2014-02-20 18:08 GMT+01:00 Marcel Stimberg notifications@github.com:

Just to make sure, I understand correctly: Using #pragma omp single for "problematic" synaptic code is much faster than doing it in parallel with an additional (non-parallel) summing step afterwards. Correct?

Reply to this email directly or view it on GitHubhttps://github.com/brian-team/brian2/issues/89#issuecomment-35644386 .

mstimberg commented 10 years ago

That I did not benchmarked properly.

Sorry, but now I'm confused -- I'm only asking about the difference between the two simulations you are showing: the first one is slow and uses the "summed variable trick", i.e. does the synaptic update in parallel (because it only involves synaptic variables) but has an additional summed variable update then that is not parallel. The second one is fast and uses the "standard" formulation without summed variables which means it does the synaptic update not in parallel.

So in sum (just for this one example): parallel update + summed variable update is slower than non-parallel update (with no need for an additional summation step).

yger commented 10 years ago

Sorry i thought your question was more generic. Yes you re right. Except the non summed one in this particular example is half" parallel for pre updates (because this is feasible based on template detection) but not post (because of g updates). Sorry if not clear, I m writing from my phone... We can discuss that tomorrow. Running speed are already very different just for one thread, however. Le 20 févr. 2014 18:24, "Marcel Stimberg" notifications@github.com a écrit :

That I did not benchmarked properly.

Sorry, but now I'm confused -- I'm only asking about the difference between the two simulations you are showing: the first one is slow and uses the "summed variable trick", i.e. does the synaptic update in parallel (because it only involves synaptic variables) but has an additional summed variable update then that is not parallel. The second one is fast and uses the "standard" formulation without summed variables which means it does the synaptic update not in parallel.

So in sum (just for this one example): parallel update + summed variable update is slower than non-parallel update (with no need for an additional summation step).

Reply to this email directly or view it on GitHubhttps://github.com/brian-team/brian2/issues/89#issuecomment-35646069 .

yger commented 10 years ago

Ok, so I think I ended up with a nice and as optimized as it can OpenMP implementation, and I'll make a (temporary) stop. Because I'm not a template expert, and I'm not sure of properly getting how compilation flags should be handled, currently I don't have the possibility to turn off completly openmp and removing #pragma flags from the code (the -openmp flag in C++ compilation is always there, and if no threads specified, only 1 thread is used as a default). A bit annoying, but we'll fix that later. The current status of this optimization is that it is almost never slowing down the code, it can only improve it. The speedup will depend on the problem, and the complexity of the model. I multi-threaded synapses creation, threshold updates, state updates, synapses if no pre/post modifications, summed variables. The stuffs I did not optimized nor changed are synaptic queue or monitors. Benchmarks showed that results are consistent over multiple nodes.

thesamovar commented 10 years ago

Thanks Pierre! I guess we should merge this as soon as possible. I'll tag it for the next alpha release. When I get a chance to go through it in more detail, I might have a few questions about what you've done.

yger commented 10 years ago

Hi all (again)

So thanks to Marcel, I also made the spike queues multi-threaded, and again, some speed can be gained here, without the loss of flexibility. I had to modified a bit the device.py file in order to provide the number of threads to the constructor of the synaptic pathway, and this is it. Note that in all the stuffs I did, I did not really attached a huge importance to code polishing and integration without openmp, so this is something that still need to be done. But no hurry. Last things that is a bit annoying, and that could maybe be optimized (I'm starting to be a bit picky, because there are almost nothing left outside the multithreading loops) is the synaptic updates when pre/post synaptic indices are used. By sorting synapses as a function of their pre/post synaptic targets, we could do that. So far, if I understood correctly, synapses are sorted by presynaptic sources (and this is a bit sad, because I guess postsynaptic concurrent access are more likely to occur). But maybe we could sort them as function of postsynaptic targets, and then parallelize. Not sure it will change a lot, but because this is starting to be sort of of game, I'll give it a try :-)

thesamovar commented 10 years ago

Sounds good. Resorting indices shouldn't be a problem - I don't think we make any assumptions on their order anywhere.

mstimberg commented 10 years ago

To quickly recap what I discussed with Pierre: I think in general (i.e. also in runtime) it would be a good idea to have a consistent, reproducible ordering of synapses. I'm afraid sooner or later users will do something like S.w = array_of_values, assuming that the order that they see will always be like that. While it may be reasonable to assume that S.connect([0, 1, 2], [3, 4, 5]) will create the synapses exactly in the order given, it is not clear at all what the "natural" order of synapses is for S.connect(True), for example and it might well differ between numpy, weave and standalone. My suggestion would therefore be to always sort synapses according to the postsynaptic target, because updating postsynaptic variables will be by far the most common usecase (even without parallelization it might be beneficial for the memory access). This is not entirely trivial, since after every synapse creation we have to potentially not only sort the synaptic indices, but also all existing synaptic state variables (weights, delays, ...) because they might have been assigned values earlier. It's certainly feasible, we have references to all concerned variables in the Synapses._registered_variables attribute, we already take care of resizing them after every creation of synapses.

To summarize: I think the best place to do the sorting would be at the end of the synapses_create template, making sure that all dependent synaptic variables are accordingly resized and rearranged.

thesamovar commented 10 years ago

I feel like insisting on a certain order could be a guarantee that ends up being expensive, and we may not make the right choice for computational efficiency in one situation or another. I'd be happier insisting that users do not make any assumptions about the order of synapses. With this approach, we could comfortably resort variables at the beginning of a run, so that if the user has called the synapses_create template multiple times they only get the cost of the resort once.

mstimberg commented 10 years ago

I'm not sure. I don't think it's necessarily expensive since sorting is fast and we only do it after a synaptic create. We can also be clever about it and only sort if we have to (i.e. make sure that all string-expression based statements already generate sorted synapses and in the end check that the first new synapse is >= the last old synapse). In general I like the "the user should not make any assumptions about sorting order" approach but at some point users will want to set synaptic variables based on an array of values and I think the S.w = array syntax is the most natural syntax for it. If the sort order is clear, there's no ambiguity. Sorting before a run also means that a simple w_before_training = S.w[:]; run(...); w_after_training = S.w[:] will get odd results. Of course there's always a way to do this properly but from Brian 1 experience it can get very confusing if the network changes just because of a run (even if it is a run(0*ms)). Especially for debugging, just seeing synaptic state variables as a maybe arbitrarily sorted, but not changing the order array is helpful.

About computational efficiency: I think sorting on the postsynaptic index is optimal for the typical "update postsynaptic variable on an incoming spike" usecase. It might not be optimal for other use cases but I don't see us doing anything more fancy in the near future, anyway.

thesamovar commented 10 years ago

Sorting is reasonably fast, although still N log N. In addition to the sort, you'll need to do an array copy for each variable each time you resort, which is N*M for M variables. What about in the case when the user constructs synapses via a loop because they can't get the string syntax to do what they want? This could end up meaning that the sorting had to be done num_neurons times. Also, keeping it in sorted order after each operation requires us to ensure that any operation that creates synapses has to leave it in the right order - I guess there is more than one way to create synapses right? or does it all come back to the same template? - and that means making sure in our code that each way of creating synapses does correctly leave it sorted, which is a sort of maintenance burden and potential source of bugs.

I'm willing to be convinced but I think it's a big decision and one we need to think through very carefully before we make it. Maybe worth opening a new issue for it?

mstimberg commented 10 years ago

Just very quickly because I have to go (and yes, better open an issue and discuss it thoroughly):

Sorting is reasonably fast, although still N log N. In addition to the sort, you'll need to do an array copy for each variable each time you resort, which is N*M for M variables.

I think it's rather one or the other: if we use a comparison sort it is N log N, but then we can do the sorting inplace (but we'll also need space to store the indices, to resort the other variables). Or we can use something like bucket sort (since we indices in the array) but then we need additional memory.

yger commented 10 years ago

Ok, maybe all of this won't really be necessary. I did preliminary tests/benchmarks, and I'm not really sure this is worth changing everything just for the standalone case, because there is not much of a gain. I'll inestigate more, but I agree that sorting synapses can start to be very confusing, and I would have rather have had it hidden from the user point of view. If we are not gaining a lot by imposing such a hard constraint, then I don't think this is worth doing

mstimberg commented 10 years ago

If it doesn't gain much, then indeed we should not come up with a complicated solution :) In that case, I think I'd vote for the following approach:

we never re-sort synapses, i.e. the order of synapses always stays the same.
synapses created explicitly with indices are created in the order given
we make some guarantee for the order of synapses created by string expressions (e.g. ordered by postsynaptic target) -- but this guarantee only applies to a single connect, the synapses from multiple connects are just concatenated.

This gives users all the information they need to set weights in explicit ways (e.g. with a flattened weight matrix) and should never lead to any unexpected changes in the order of synapses. If at some point it turns out that sorting by postsynaptic target does indeed have a significant benefit (e.g. because of cache issues) then we could also consider just checking for the order of synapses and raising a warning if they are not correctly sorted. I think in many cases it should be possible for the user to rewrite the connect calls to get the correct order.

thesamovar commented 10 years ago

That works for me. We could also have an explicit sort_synapses method that the user could call optionally.

mstimberg commented 10 years ago

We could also have an explicit sort_synapses method that the user could call optionally.

Good idea, the warning could then explicitly mention "consider running S.sort_synapses() for better performance" or something like that.

thesamovar commented 10 years ago

Yeah, and in some situations (e.g. GPU) where a particular order was required, we might raise an error rather than a warning.

mstimberg commented 10 years ago

I rebased @yger's branch on current master and merged all the commits into one (this makes it easier for later rebasing). I did not check it thoroughly, but it seems to work. Now what is missing before merging it into master is mostly how to configure it and how to switch it off completely (i.e. to make the code not contain any #pragma statements if openmp is switched off and no -fopenmp in the compiler flags).

thesamovar commented 9 years ago

This is all done now, right? Closing assuming it is done - reopen if not.

brian-team / brian2

Consider using OpenMP in code generation #89