Open shamoons opened 9 years ago
It is possible, neurons in the same layer are independent from each other so they could be computed in a parallel fashion (WebCL maybe?) but I haven't found the time yet to explore this approach. This could increase the speed dramatically since layers now are being computed sequentially, so computing a 1024-neurons layer takes 1024 iterations...
I was thinking the cluster model. I'm happy to work on this and submit a PR. Can you perhaps guide me about where to get started?
What about something like https://github.com/adambom/parallel.js
Yes, that would be great (:
Well, the parts of the network that can be parallelized, as I said before, are the layers. The network is organized in a set of layers, that are computed in a fixed order. The output of one layer becomes the input for the next one (not always tho, depending on the architecture it could get more complicated, like on recurrent NN or LSTM's), so you need to fully compute previous layers before starting to compute the current one. But the neurons within a layers can be activated/computed in any order, they are independent from each other, so this is the part of the work that could be split among different workers/cpu's, the gpu or any model that you may want to implement. You could start checking the source code of the Layer class.
I don't know if the overhead of spawning a worker would take longer than the computation of the neuron itself, i think the only way that parallelizing this would make it faster is if hundreds or thousands of neurons get computed at the same time, which would be possible using the gpu, since the computation of a single neuron is actually very simple (just some sums and a few multiplications), and this would make possible to have huge layers (now layers are limited to a couple of dozens of neurons before getting too laggy)
If you plan to give this a try and have any questions let me know and I'd be happy to help you out if I can.
At the propagate
: https://github.com/cazala/synaptic/blob/master/src/layer.js#L46
I am thinking instead of a for
loop to iterate, what about splitting it up into an array of id
's. Then use the parallel.js
map to do it across multiple cores. This way, we wouldn't spin up a new core for each calculation, but rather a batch of calculations per code. What do you think?
Yes you could replace the for loops in the propagate
and activate
functions, but then in order to get this code executed you would need to set the network to be unoptimized: myNetwork.setOptimize(false)
, and then the network becomes several times slower...
Check this part of the wiki.
You will note that if you print on the console the activate or propagate functions of a network before and after activating the network for the first time, the output will look like this:
Before the first activation: console.log(myNetwork.activate);
function (t){if(this.optimized===!1){this.layers.input.activate(t);for(var e in this.layers.hidden)this.layers.hidden[e].activate();return this.layers.output.activate()}return null==this.optimized&&this.optimize(),this.optimized.activate(t)}
After the first activation: myNetwork.activate([0,0]); console.log(myNetwork.activate);
function (input){
F[1] = input[0];
F[2] = input[1];
F[4] = F[5];
F[5] = F[6];
F[5] += F[1] * F[7];
F[5] += F[2] * F[8];
F[3] = (1 / (1 + Math.exp(-F[5])));
F[9] = F[3] * (1 - F[3]);
F[10] = F[1];
F[11] = F[2];
F[16] = F[17];
F[17] = F[18];
F[17] += F[1] * F[19];
F[17] += F[2] * F[20];
F[15] = (1 / (1 + Math.exp(-F[17])));
F[21] = F[15] * (1 - F[15]);
F[22] = F[1];
F[23] = F[2];
F[27] = F[28];
F[28] = F[29];
F[28] += F[1] * F[30];
F[28] += F[2] * F[31];
F[26] = (1 / (1 + Math.exp(-F[28])));
F[32] = F[26] * (1 - F[26]);
F[33] = F[1];
F[34] = F[2];
F[38] = F[39];
F[39] = F[40];
F[39] += F[3] * F[13];
F[39] += F[15] * F[25];
F[39] += F[26] * F[36];
F[37] = (1 / (1 + Math.exp(-F[39])));
F[41] = F[37] * (1 - F[37]);
F[42] = F[3];
F[43] = F[15];
F[44] = F[26];
var output = [];
output[0] = F[37];
return output;
}
This optimized code is built using Neuron.optimize(), the Network object collects this code from all its neurons and concatenates them together to make the final optimized code that you are seeing above. So in order to parallelize this I guess you will need to modify the Network.optimize() method to split the optimized code of the neurons among workers, instead of just concatenating them together.
I don't know how this optimize
works. Can you explain what the rationale is behind it?
@shamoons: Neuron.optimite()
hardcodes the behaviour of the specific neuron into the minimum number of operations. For example, this is how the state of a neuron is computed at each step:
this.state = this.selfconnection.gain * this.selfconnection.weight * this.state + this.bias;
If a connection is not gated, the gain of that connections is always 1
, and if a neuron is not self connected, its self-connection weight is always 0, so for most of the neurons we'd be doing:
this.state = 1 * 0 * this.state + this.bias;
which could be translated just as:
this.state = this.bias
That kind of situations happen on many parts of the code, specially where connections are not gated and/or neurons are not self-connected (most of the cases)
And then Network.optimize()
takes the optimized chunks of code from all the neurons in the network, concats them all together in one big single function (without for-loops or function calls inside) and it replaces all the variable to references in a Float32Array called F
(that's why you see F[#]
all over the code) to make the access to the vars and the computation faster (its always quicker to access a certain index in a typed array like Float32Array than accessing this.some.nested.property[index]
).
This feature makes the network several hundreds of times faster, you can test it by creating a network, cloning it with Network.clone()
and then Network.setOptimize(false)
.
@cazala. If the optimize
d function is what's running in each step, can that be made parallel somehow?
@shamoons: Yes, I think the best approach would be to modify Network.optimize()
to separate the hardcoded code within the layers, instead of concat it all together.
Let's say you create a simple perceptron: var net = new Architect.Perceptron(2,3,1);
The neurons within the same layer can be computed independently, so the 3 units in the hidden layer can be parallelized. After optimization, the activation function of the perceptron described above would look like this:
You can see at the begining neurons 1 and 2 (they are the neurons from the input layer, so their activation value comes from the environment, that's why their value is not computed but rather comes from the arguments of the function), neurons 3, 4, and 5 are in the hidden layer, and 6 is the only neuron in the output layer.
Neurons 3, 4, and 5 computation could be parallelized, since their output only depends on the values of the input layer. After computing the whole hidden layer (3 neurons) we could do the same for the output layer (in this case there's only 1 neuron, but if there were more they could be computed in a parallel way).
On a simple perceptron like this one it might not improve the performance so drastically, but if the network has 100's or 1000's of neurons per layer and multiple hidden layers it would make a difference. At some point I was planning to implement a model of a parallel layer that would execute on the GPU and process 1024 neurons all at once, but I never got the time to actually play around with WebCL.
I'm really digging parallel.js
- it seems like a pretty easy way to take advantage of multiple cores. Eventually, GPU's would be nice, but I think taking full advantage of multicore CPU's is a good start.
If you could guide me a bit, I'd be happy to take a crack at it.
@shamoons okay, i ll try to modify a little bit the code of Network.optimize()
and Neuron.optimize()
to provide the chunks of code separated by Layer
s and Neuron
s instead of all together in one huge array.
Right now Network.optimize()
get something like this:
sentences = [
"F[1] = input[0];",
"F[2] = input[1];",
"F[4] = F[5];",
...
]
I could organize it by layers, and within each layer, by neurons:
sentences = {
0: {
0: ["F[1] = input[0];", "F[2] = input[1];", "F[4] = F[5];"],
1: [ ... ],
}
...
}
This way, you know which chunks of code can be executed in parallel. The layers have to be executed in order, but the neurons within each layer can be executed in any order, or in parallel. Say you could execute sentence[0][0], sentence[0][1], sentence[0][2], sentence[1][0]
or sentence[0][2], sentence[0][0], sentence[0][1], sentence[1][0]
but you can't do sentence[0][0],
sentence[1][0], sentence[0][1], sentence[0][2]
@cazala that would be great. Then I can split the layers up across multiple cores. Presumably there will be a lot (depending on your network) neurons per layer, so we can batch up neurons / numCPU
per core.
@shamoons I just pushed an update to master. Now inside Network.optimize()
you have access to three arrays containing all the hardcode: activation_sentences
, trace_sentences
and propagation_sentences
. Those arrays are multidimensional and they contain all the hardcoded sentences organized by the following indexes: array[layer][neuron]
. So for example to get all the activation sentences for the first neuron of the third layer you should do activation_sentences[2][0]
and you would get an array with all the sentences as strings.
To build the activation function of a neuron you have to concat all its activation sentences followed by all its trace sentences. Here is an example of how the activation function of the whole network Network.activate()
is build (all the neurons concatenated)
for (var currentLayer in optimized.activation_sentences) {
if (optimized.activation_sentences[currentLayer].length > 0)
{
for (var currentNeuron in optimized.activation_sentences[currentLayer])
{
hardcode += optimized.activation_sentences[currentLayer][currentNeuron].join(" ");
hardcode += optimized.trace_sentences[currentLayer][currentNeuron].join(" ");
}
}
}
You could instead of concat them all together under one function, create separate ones and compute them in parallel.
In order to build the propagation function Network.propagate()
you just have to concat all the propagation sentences like:
for (var currentLayer in optimized.propagation_sentences)
for (var currentNeuron in optimized.propagation_sentences[currentLayer])
hardcode += optimized.propagation_sentences[currentLayer][currentNeuron].join(" ") + " ";
You can check this part of the source code to see how it works, from line 138 to line 172.
One other issue is that we would need to make it either callback or promise based because currently it is synchronous and the return expects that.
I will start to work on some changes and push my changes to https://github.com/shamoons/synaptic/tree/parallel/multi-core-processing
Basically, we should have as many functions as layers, is that correct? So instead of: var constructor = new Function(hardcode);
which puts it all together, we should have a new Function
per layer that runs activations and propagations for a particular layer.
@cazala this may not be possible: https://github.com/adambom/parallel.js/issues/97#issuecomment-80540243
@shamoons mmh this is probably really bad from a performance standpoint so it's probably not a solution, but if you can't send the Function
object you could still send it as a string, right? like:
var a = function(){ console.log('hi'); } // the function
var b = a.toString().split('{')[1].split('}')[0]; // pass this string to the worker
var c = new Function(b); // recreate the function within the worker
c(); // hi
just saying, but probably doing that takes more time than what we save with the parallel execution
@cazala interesting strategy. perhaps, instead of defining the function once(https://github.com/cazala/synaptic/blob/master/src/network.js#L174) , the string can be passed to the workers and they can create their own functions? This way, we don't have to decompose the function, then recompose it.
@shamoons yes actually that makes a lot more sense since we already have all the sentences as an array of strings, they just need to be split and distributed among their respective workers, which then can build and store their own function(s)
@cazala how about something like this?
var hardcode_layers = {};
var hardcode = "";
hardcode += "var F = Float64Array ? new Float64Array(" + optimized.memory +
") : []; ";
for (var i in optimized.variables)
hardcode += "F[" + optimized.variables[i].id + "] = " + (optimized.variables[
i].value || 0) + "; ";
hardcode += "var activate = function(input){\n";
for (var i in optimized.inputs)
hardcode += "F[" + optimized.inputs[i] + "] = input[" + i + "]; ";
for (var currentLayer in optimized.activation_sentences) {
hardcode_layers[currentLayer] = hardcode;
if (optimized.activation_sentences[currentLayer].length > 0)
{
for (var currentNeuron in optimized.activation_sentences[currentLayer]){
hardcode[currentLayer] += optimized.activation_sentences[currentLayer][currentNeuron].join(" ");
hardcode[currentLayer] += optimized.trace_sentences[currentLayer][currentNeuron].join(" ");
}
}
}
@shamoons that looks about right, but there you are only building the activation
function, you will still need to hardcode an individual propagation
function for each layer as well
I will do the unoptimized version first. So I think I that happens in layer.js
: https://github.com/cazala/synaptic/blob/master/src/layer.js#L46 and https://github.com/cazala/synaptic/blob/master/src/layer.js#L22
is that right?
yes that's right
@cazala you can see my work so far at https://github.com/shamoons/synaptic/tree/multi/worker-architecture
I am introducing Promises, specifically via bluebird
. Any objections or thoughts on that? Or would you rather it be callback based?
Okay - another issue I'm having. I managed to get the function into a worker: https://github.com/shamoons/synaptic/blob/multi/worker-architecture/src/layer.js#L74. The issue I'm having is passing the Neuron
itself to the worker. Since a Neuron
has selfconnections
, if I simply JSON.stringify
it, I get Circular reference errors. Any thoughts on alleviating this?
+1 for bluebird
and Promise-based. For the circular-json issue, there are workarounds but I've never used them and I dont know an easy way to use them from within the Worker, other than including the source code into the Worker scope itself..
@cazala hmmmm - the circular-json still isn't really resolved with that package. I really think that multi-threading would make this so much better. Keep coming into problems!
One possibility is to require a file (http://adambom.github.io/parallel.js/) or some business with eval
. I'll keep searching around for a good solution.
Hey guys, I briefly scanned this thread and thought I might suggest trying out a library I've been working on for parallelizing JS calculations like this: https://github.com/awlange/mathworkers
I've got it working for both node.js cluster and HTML5 WebWorkers. And I've been able to confirm it does speed up some common linear algebra operations. Let me know what you think. There's a chance it could be useful. Maybe, maybe not. Just thought I'd suggest it.
@cazala looks like we can use this to create the optimized
version. @awlange has done some awesome work it seems! Question for you @awlange, what sort of data can be passed to a worker?
Thanks for taking a look at my project!
You can send any JSON serializable data to between the coordinator and its workers via the sendDataToWorkers() and sendDataToCoordinator() functions (see JSDocs http://www.mathworkersjs.org/static/doc/index.html for Coordinator and MathWorker classes). Those methods can be used for passing parameters, small object, and such, but I don't recommend passing large arrays of numbers that way. JSON serialization costs can kill the performance gains of the parallel computation. So, for passing numerical data, I recommend adapting the data to use the Vector and/or Matrix classes, which ultimately just wrap a Float64Array object. With the Float64Array, MathWorkers uses WebWorkers context switching for fast communication, avoiding JSON serialization. And for node.js, I convert the ArrayBuffer into a string and pass the string, which, even though it might not seem like it, is much faster than JSON serialization, too.
It seems like most of the data you would want Syanptic to compute in parallel is indeed data that can be stored via Float64Array, which means MathWorkers can handle that communication fairly well.
I'll try digging into the code of Synaptic a bit more over the next week or so and see how/where/if MathWorkers might be able to help. Let me know if you have more questions, too. I'm totally open to adding or changing features to MathWorkers. It's been a while since I've touched that code, and I've been hoping to get back into it with a real application in mind!
Sounds promising, I guess it wouldn't be that hard to write a new type of layer (ParallelLayer
or something like that) with the same interface as Layer
, using MathWorkers Vectors instead of an array of Neuron
's, but using the same algorithm as the activate
and progapate
method use. Then tweak the Network
class a little bit to handle this kind of layer asynchronously. It's doable.
But how do we pass the neuron propagate functions through?
Not sure if it's applicable here, but maybe some inspiration can be taken from https://github.com/totemstech/neuraln as it supports multi-threaded training. It is written in C++, though.
Any progress on that? Maybe I can help? I wanted to use NeuralN since it's written in C++, but it seems like it isn't mantained anymore.
I use parallel.js myself alot, but I think that Mathworkers might work well, too.
I wouldn't recommend using WebCL for this kind of task, as the installation process is pretty blown up compared to installing one node.js module (or depending on it). If you need any help, let me know, I will use your module in a current project which will have about 50 input nerons and 3 output neurons (hidden layers and neurons not defined yet) and the sample size will be about 18,000,000. So parallelization would be awesome.
+1 for the idea to introduce a new Layer instead of rewriting the old sync Layer. I guess that this would increase performance, since checking if parallelization is wished is not needed on each iteration.
I've been working for the past few weeks on investigating the state of the art for doing GPU computations in a browser (in an way that's accessible to most, without requiring complex WebCL installs). I'm thinking about pulling together all the pieces on this topic into a standalone library for accelerating deep learning and other machine learning applications in the browser, starting with a basic GEMM function.
My hope is to provide a base for libraries like this one to build on, and (hopefully) get performance gains that approach those found in Caffe.
Do people think this could be useful?
Regarding the previous comment, and the earlier comment on WebCL - note that running OpenCL is a browser environment is not exactly straightforward, if you are really interested in the nitty-gritty details, I suggest to check out Intel's River Trail work, which they ended up turning into a FireFox extension to work around some of the issues - it's a really interesting read, because they're basically compiling a subset of JavaScript to OpenCL dynamically, by providing high-level map/reduce and filter equivalents: http://composition.al/blog/2015/02/24/to-opencl-from-javascript-via-js-ctypes-or-how-we-rewrote-the-river-trail-firefox-extension/
It seems that one of the most portable ways to use OpenCL from JavaScript is arrayfire-js: https://github.com/arrayfire/arrayfire-js
For this to be adopted, all array handling in synaptic would need to be moved to a single helper function/module, so that (if available) a different back-end can be used for all computations.
I think Synaptic is more a technology demonstration, less a gonna be multi-threaded power tool. Instead to bend the limitations of JavaScript and Node around I would suggest to rewrite this wonderful project in C. The Intel C compiler (free for students) is amazing to parallelize loops without manual touches. Pure C code is also easier to convert to CUDA.
You may want to take a look at the links I posted: Note that arrayfire will help you do just that, and it does in fact support more than just pure C kernels, i.e . using OpenCL (GPUs/FPGAs), including CUDA specifically. All the performance-critical code would be JIT compiled into the corresponding kernels.
Frankly, compared to adopting an OpenCL-backend like Arrayfire, rewriting the whole thing from scratch in C would be a huge step backwards.
The kind of parallelization you are talking about is OpenMP stuff (i.e. SIMD), which does not support back-ends other than CPUs (GPUs, FPGA) - wheras arrayfire is just another dependency (optional) to let array handling, and vectorization, be handled by machine code where feasible.
I'm also looking into these libraries which are like putting javascript on steroid:
From my understanding, the data structure in synaptic.js is more like a linked-list implementation. If we re-model everything into matrices, it may lose flexibility in configuring the architecture.
note that CUDA is nvidia specific (and specific to using node, i.e. basically excludes the browser scenario), while GPUs access via WebGL is unlikely to be what you want in a standalone node setup.
There are 3 kinds of implementable concurrency: node.js one (where you can just call C++ code or CUDA), Web Worker concurrency, which is not freezing the UI, and WebCL one, which is nearly as fast as CUDA.
We should probably implement all of them once.
This is more of a feature request than anything, but perhaps this can be set up to do parallel processing?