Set this up to be parallelized

shamoons commented 9 years ago

This is more of a feature request than anything, but perhaps this can be set up to do parallel processing?

cazala commented 9 years ago

It is possible, neurons in the same layer are independent from each other so they could be computed in a parallel fashion (WebCL maybe?) but I haven't found the time yet to explore this approach. This could increase the speed dramatically since layers now are being computed sequentially, so computing a 1024-neurons layer takes 1024 iterations...

shamoons commented 9 years ago

I was thinking the cluster model. I'm happy to work on this and submit a PR. Can you perhaps guide me about where to get started?

shamoons commented 9 years ago

What about something like https://github.com/adambom/parallel.js

cazala commented 9 years ago

Yes, that would be great (:

Well, the parts of the network that can be parallelized, as I said before, are the layers. The network is organized in a set of layers, that are computed in a fixed order. The output of one layer becomes the input for the next one (not always tho, depending on the architecture it could get more complicated, like on recurrent NN or LSTM's), so you need to fully compute previous layers before starting to compute the current one. But the neurons within a layers can be activated/computed in any order, they are independent from each other, so this is the part of the work that could be split among different workers/cpu's, the gpu or any model that you may want to implement. You could start checking the source code of the Layer class.

I don't know if the overhead of spawning a worker would take longer than the computation of the neuron itself, i think the only way that parallelizing this would make it faster is if hundreds or thousands of neurons get computed at the same time, which would be possible using the gpu, since the computation of a single neuron is actually very simple (just some sums and a few multiplications), and this would make possible to have huge layers (now layers are limited to a couple of dozens of neurons before getting too laggy)

If you plan to give this a try and have any questions let me know and I'd be happy to help you out if I can.

shamoons commented 9 years ago

At the propagate: https://github.com/cazala/synaptic/blob/master/src/layer.js#L46

I am thinking instead of a for loop to iterate, what about splitting it up into an array of id's. Then use the parallel.js map to do it across multiple cores. This way, we wouldn't spin up a new core for each calculation, but rather a batch of calculations per code. What do you think?

cazala commented 9 years ago

Yes you could replace the for loops in the propagate and activate functions, but then in order to get this code executed you would need to set the network to be unoptimized: myNetwork.setOptimize(false), and then the network becomes several times slower...

Check this part of the wiki.

You will note that if you print on the console the activate or propagate functions of a network before and after activating the network for the first time, the output will look like this:

Before the first activation: console.log(myNetwork.activate);

function (t){if(this.optimized===!1){this.layers.input.activate(t);for(var e in this.layers.hidden)this.layers.hidden[e].activate();return this.layers.output.activate()}return null==this.optimized&&this.optimize(),this.optimized.activate(t)}

After the first activation: myNetwork.activate([0,0]); console.log(myNetwork.activate);

function (input){
F[1] = input[0];
 F[2] = input[1];
 F[4] = F[5];
 F[5] = F[6];
 F[5] += F[1] * F[7];
 F[5] += F[2] * F[8];
 F[3] = (1 / (1 + Math.exp(-F[5])));
 F[9] = F[3] * (1 - F[3]);
F[10] = F[1];
 F[11] = F[2];
F[16] = F[17];
 F[17] = F[18];
 F[17] += F[1] * F[19];
 F[17] += F[2] * F[20];
 F[15] = (1 / (1 + Math.exp(-F[17])));
 F[21] = F[15] * (1 - F[15]);
F[22] = F[1];
 F[23] = F[2];
F[27] = F[28];
 F[28] = F[29];
 F[28] += F[1] * F[30];
 F[28] += F[2] * F[31];
 F[26] = (1 / (1 + Math.exp(-F[28])));
 F[32] = F[26] * (1 - F[26]);
F[33] = F[1];
 F[34] = F[2];
F[38] = F[39];
 F[39] = F[40];
 F[39] += F[3] * F[13];
 F[39] += F[15] * F[25];
 F[39] += F[26] * F[36];
 F[37] = (1 / (1 + Math.exp(-F[39])));
 F[41] = F[37] * (1 - F[37]);
F[42] = F[3];
 F[43] = F[15];
 F[44] = F[26];
 var output = [];
 output[0] = F[37];
 return output;
 }

This optimized code is built using Neuron.optimize(), the Network object collects this code from all its neurons and concatenates them together to make the final optimized code that you are seeing above. So in order to parallelize this I guess you will need to modify the Network.optimize() method to split the optimized code of the neurons among workers, instead of just concatenating them together.

shamoons commented 9 years ago

I don't know how this optimize works. Can you explain what the rationale is behind it?

cazala commented 9 years ago

@shamoons: Neuron.optimite() hardcodes the behaviour of the specific neuron into the minimum number of operations. For example, this is how the state of a neuron is computed at each step:

this.state = this.selfconnection.gain * this.selfconnection.weight * this.state + this.bias;

If a connection is not gated, the gain of that connections is always 1, and if a neuron is not self connected, its self-connection weight is always 0, so for most of the neurons we'd be doing:

this.state = 1 * 0 * this.state + this.bias;

which could be translated just as:

this.state = this.bias

That kind of situations happen on many parts of the code, specially where connections are not gated and/or neurons are not self-connected (most of the cases)

And then Network.optimize() takes the optimized chunks of code from all the neurons in the network, concats them all together in one big single function (without for-loops or function calls inside) and it replaces all the variable to references in a Float32Array called F (that's why you see F[#] all over the code) to make the access to the vars and the computation faster (its always quicker to access a certain index in a typed array like Float32Array than accessing this.some.nested.property[index]).

This feature makes the network several hundreds of times faster, you can test it by creating a network, cloning it with Network.clone() and then Network.setOptimize(false).

shamoons commented 9 years ago

@cazala. If the optimized function is what's running in each step, can that be made parallel somehow?

cazala commented 9 years ago

@shamoons: Yes, I think the best approach would be to modify Network.optimize() to separate the hardcoded code within the layers, instead of concat it all together.

Let's say you create a simple perceptron: var net = new Architect.Perceptron(2,3,1);

The neurons within the same layer can be computed independently, so the 3 units in the hidden layer can be parallelized. After optimization, the activation function of the perceptron described above would look like this:

activation function

You can see at the begining neurons 1 and 2 (they are the neurons from the input layer, so their activation value comes from the environment, that's why their value is not computed but rather comes from the arguments of the function), neurons 3, 4, and 5 are in the hidden layer, and 6 is the only neuron in the output layer.

Neurons 3, 4, and 5 computation could be parallelized, since their output only depends on the values of the input layer. After computing the whole hidden layer (3 neurons) we could do the same for the output layer (in this case there's only 1 neuron, but if there were more they could be computed in a parallel way).

On a simple perceptron like this one it might not improve the performance so drastically, but if the network has 100's or 1000's of neurons per layer and multiple hidden layers it would make a difference. At some point I was planning to implement a model of a parallel layer that would execute on the GPU and process 1024 neurons all at once, but I never got the time to actually play around with WebCL.

shamoons commented 9 years ago

I'm really digging parallel.js - it seems like a pretty easy way to take advantage of multiple cores. Eventually, GPU's would be nice, but I think taking full advantage of multicore CPU's is a good start.

If you could guide me a bit, I'd be happy to take a crack at it.

cazala commented 9 years ago

@shamoons okay, i ll try to modify a little bit the code of Network.optimize() and Neuron.optimize() to provide the chunks of code separated by Layers and Neurons instead of all together in one huge array.

Right now Network.optimize() get something like this:

sentences = [
"F[1] = input[0];",
"F[2] = input[1];",
"F[4] = F[5];",
 ...
]

I could organize it by layers, and within each layer, by neurons:

sentences = {
    0: {
        0: ["F[1] = input[0];", "F[2] = input[1];", "F[4] = F[5];"],
        1: [ ... ],
    }
    ...
}

This way, you know which chunks of code can be executed in parallel. The layers have to be executed in order, but the neurons within each layer can be executed in any order, or in parallel. Say you could execute sentence[0][0], sentence[0][1], sentence[0][2], sentence[1][0]

or sentence[0][2], sentence[0][0], sentence[0][1], sentence[1][0]

but you can't do sentence[0][0],sentence[1][0], sentence[0][1], sentence[0][2]

shamoons commented 9 years ago

@cazala that would be great. Then I can split the layers up across multiple cores. Presumably there will be a lot (depending on your network) neurons per layer, so we can batch up neurons / numCPU per core.

cazala commented 9 years ago

@shamoons I just pushed an update to master. Now inside Network.optimize() you have access to three arrays containing all the hardcode: activation_sentences, trace_sentences and propagation_sentences. Those arrays are multidimensional and they contain all the hardcoded sentences organized by the following indexes: array[layer][neuron]. So for example to get all the activation sentences for the first neuron of the third layer you should do activation_sentences[2][0] and you would get an array with all the sentences as strings.

To build the activation function of a neuron you have to concat all its activation sentences followed by all its trace sentences. Here is an example of how the activation function of the whole network Network.activate() is build (all the neurons concatenated)

for (var currentLayer in optimized.activation_sentences) {
      if (optimized.activation_sentences[currentLayer].length > 0)
      {
          for (var currentNeuron in optimized.activation_sentences[currentLayer])
          {
              hardcode += optimized.activation_sentences[currentLayer][currentNeuron].join(" ");
              hardcode += optimized.trace_sentences[currentLayer][currentNeuron].join(" ");
          }
     }
 }

You could instead of concat them all together under one function, create separate ones and compute them in parallel.

In order to build the propagation function Network.propagate() you just have to concat all the propagation sentences like:

for (var currentLayer in optimized.propagation_sentences)
      for (var currentNeuron in optimized.propagation_sentences[currentLayer])
        hardcode += optimized.propagation_sentences[currentLayer][currentNeuron].join(" ") + " ";

You can check this part of the source code to see how it works, from line 138 to line 172.

shamoons commented 9 years ago

One other issue is that we would need to make it either callback or promise based because currently it is synchronous and the return expects that.

I will start to work on some changes and push my changes to https://github.com/shamoons/synaptic/tree/parallel/multi-core-processing

shamoons commented 9 years ago

Basically, we should have as many functions as layers, is that correct? So instead of: var constructor = new Function(hardcode); which puts it all together, we should have a new Function per layer that runs activations and propagations for a particular layer.

shamoons commented 9 years ago

@cazala this may not be possible: https://github.com/adambom/parallel.js/issues/97#issuecomment-80540243

cazala commented 9 years ago

@shamoons mmh this is probably really bad from a performance standpoint so it's probably not a solution, but if you can't send the Function object you could still send it as a string, right? like:

var a = function(){ console.log('hi'); } // the function
var b = a.toString().split('{')[1].split('}')[0]; // pass this string to the worker
var c = new Function(b); // recreate the function within the worker
c(); // hi

just saying, but probably doing that takes more time than what we save with the parallel execution

shamoons commented 9 years ago

@cazala interesting strategy. perhaps, instead of defining the function once(https://github.com/cazala/synaptic/blob/master/src/network.js#L174) , the string can be passed to the workers and they can create their own functions? This way, we don't have to decompose the function, then recompose it.

cazala commented 9 years ago

@shamoons yes actually that makes a lot more sense since we already have all the sentences as an array of strings, they just need to be split and distributed among their respective workers, which then can build and store their own function(s)

shamoons commented 9 years ago

@cazala how about something like this?

    var hardcode_layers = {};
    var hardcode = "";
    hardcode += "var F = Float64Array ? new Float64Array(" + optimized.memory +
      ") : []; ";
    for (var i in optimized.variables)
      hardcode += "F[" + optimized.variables[i].id + "] = " + (optimized.variables[
        i].value || 0) + "; ";
    hardcode += "var activate = function(input){\n";
    for (var i in optimized.inputs)
      hardcode += "F[" + optimized.inputs[i] + "] = input[" + i + "]; ";
    for (var currentLayer in optimized.activation_sentences) {
      hardcode_layers[currentLayer] = hardcode;
      if (optimized.activation_sentences[currentLayer].length > 0)
      {
        for (var currentNeuron in optimized.activation_sentences[currentLayer]){
          hardcode[currentLayer] += optimized.activation_sentences[currentLayer][currentNeuron].join(" ");
          hardcode[currentLayer] += optimized.trace_sentences[currentLayer][currentNeuron].join(" ");
        }
      }
    }

cazala commented 9 years ago

@shamoons that looks about right, but there you are only building the activation function, you will still need to hardcode an individual propagation function for each layer as well

shamoons commented 9 years ago

I will do the unoptimized version first. So I think I that happens in layer.js: https://github.com/cazala/synaptic/blob/master/src/layer.js#L46 and https://github.com/cazala/synaptic/blob/master/src/layer.js#L22

is that right?

cazala commented 9 years ago

yes that's right

shamoons commented 9 years ago

@cazala you can see my work so far at https://github.com/shamoons/synaptic/tree/multi/worker-architecture

I am introducing Promises, specifically via bluebird. Any objections or thoughts on that? Or would you rather it be callback based?

shamoons commented 9 years ago

Okay - another issue I'm having. I managed to get the function into a worker: https://github.com/shamoons/synaptic/blob/multi/worker-architecture/src/layer.js#L74. The issue I'm having is passing the Neuron itself to the worker. Since a Neuron has selfconnections, if I simply JSON.stringify it, I get Circular reference errors. Any thoughts on alleviating this?

cazala commented 9 years ago

+1 for bluebird and Promise-based. For the circular-json issue, there are workarounds but I've never used them and I dont know an easy way to use them from within the Worker, other than including the source code into the Worker scope itself..

shamoons commented 9 years ago

@cazala hmmmm - the circular-json still isn't really resolved with that package. I really think that multi-threading would make this so much better. Keep coming into problems!

One possibility is to require a file (http://adambom.github.io/parallel.js/) or some business with eval. I'll keep searching around for a good solution.

awlange commented 9 years ago

Hey guys, I briefly scanned this thread and thought I might suggest trying out a library I've been working on for parallelizing JS calculations like this: https://github.com/awlange/mathworkers

I've got it working for both node.js cluster and HTML5 WebWorkers. And I've been able to confirm it does speed up some common linear algebra operations. Let me know what you think. There's a chance it could be useful. Maybe, maybe not. Just thought I'd suggest it.

shamoons commented 9 years ago

@cazala looks like we can use this to create the optimized version. @awlange has done some awesome work it seems! Question for you @awlange, what sort of data can be passed to a worker?

awlange commented 9 years ago

Thanks for taking a look at my project!

You can send any JSON serializable data to between the coordinator and its workers via the sendDataToWorkers() and sendDataToCoordinator() functions (see JSDocs http://www.mathworkersjs.org/static/doc/index.html for Coordinator and MathWorker classes). Those methods can be used for passing parameters, small object, and such, but I don't recommend passing large arrays of numbers that way. JSON serialization costs can kill the performance gains of the parallel computation. So, for passing numerical data, I recommend adapting the data to use the Vector and/or Matrix classes, which ultimately just wrap a Float64Array object. With the Float64Array, MathWorkers uses WebWorkers context switching for fast communication, avoiding JSON serialization. And for node.js, I convert the ArrayBuffer into a string and pass the string, which, even though it might not seem like it, is much faster than JSON serialization, too.

It seems like most of the data you would want Syanptic to compute in parallel is indeed data that can be stored via Float64Array, which means MathWorkers can handle that communication fairly well.

I'll try digging into the code of Synaptic a bit more over the next week or so and see how/where/if MathWorkers might be able to help. Let me know if you have more questions, too. I'm totally open to adding or changing features to MathWorkers. It's been a while since I've touched that code, and I've been hoping to get back into it with a real application in mind!

cazala commented 9 years ago

Sounds promising, I guess it wouldn't be that hard to write a new type of layer (ParallelLayer or something like that) with the same interface as Layer, using MathWorkers Vectors instead of an array of Neuron's, but using the same algorithm as the activate and progapate method use. Then tweak the Network class a little bit to handle this kind of layer asynchronously. It's doable.

shamoons commented 9 years ago

But how do we pass the neuron propagate functions through?

JohnnySheffield commented 9 years ago

Not sure if it's applicable here, but maybe some inspiration can be taken from https://github.com/totemstech/neuraln as it supports multi-threaded training. It is written in C++, though.

ghost commented 9 years ago

Any progress on that? Maybe I can help? I wanted to use NeuralN since it's written in C++, but it seems like it isn't mantained anymore.

I use parallel.js myself alot, but I think that Mathworkers might work well, too.

I wouldn't recommend using WebCL for this kind of task, as the installation process is pretty blown up compared to installing one node.js module (or depending on it). If you need any help, let me know, I will use your module in a current project which will have about 50 input nerons and 3 output neurons (hidden layers and neurons not defined yet) and the sample size will be about 18,000,000. So parallelization would be awesome.

+1 for the idea to introduce a new Layer instead of rewriting the old sync Layer. I guess that this would increase performance, since checking if parallelization is wished is not needed on each iteration.

waylonflinn commented 8 years ago

I've been working for the past few weeks on investigating the state of the art for doing GPU computations in a browser (in an way that's accessible to most, without requiring complex WebCL installs). I'm thinking about pulling together all the pieces on this topic into a standalone library for accelerating deep learning and other machine learning applications in the browser, starting with a basic GEMM function.

My hope is to provide a base for libraries like this one to build on, and (hopefully) get performance gains that approach those found in Caffe.

Do people think this could be useful?

UniqueFool commented 8 years ago

Regarding the previous comment, and the earlier comment on WebCL - note that running OpenCL is a browser environment is not exactly straightforward, if you are really interested in the nitty-gritty details, I suggest to check out Intel's River Trail work, which they ended up turning into a FireFox extension to work around some of the issues - it's a really interesting read, because they're basically compiling a subset of JavaScript to OpenCL dynamically, by providing high-level map/reduce and filter equivalents: http://composition.al/blog/2015/02/24/to-opencl-from-javascript-via-js-ctypes-or-how-we-rewrote-the-river-trail-firefox-extension/

UniqueFool commented 8 years ago

It seems that one of the most portable ways to use OpenCL from JavaScript is arrayfire-js: https://github.com/arrayfire/arrayfire-js

For this to be adopted, all array handling in synaptic would need to be moved to a single helper function/module, so that (if available) a different back-end can be used for all computations.

ghost commented 8 years ago

I think Synaptic is more a technology demonstration, less a gonna be multi-threaded power tool. Instead to bend the limitations of JavaScript and Node around I would suggest to rewrite this wonderful project in C. The Intel C compiler (free for students) is amazing to parallelize loops without manual touches. Pure C code is also easier to convert to CUDA.

UniqueFool commented 8 years ago

You may want to take a look at the links I posted: Note that arrayfire will help you do just that, and it does in fact support more than just pure C kernels, i.e . using OpenCL (GPUs/FPGAs), including CUDA specifically. All the performance-critical code would be JIT compiled into the corresponding kernels.

Frankly, compared to adopting an OpenCL-backend like Arrayfire, rewriting the whole thing from scratch in C would be a huge step backwards.

The kind of parallelization you are talking about is OpenMP stuff (i.e. SIMD), which does not support back-ends other than CPUs (GPUs, FPGA) - wheras arrayfire is just another dependency (optional) to let array handling, and vectorization, be handled by machine code where feasible.

andymakhk commented 8 years ago

I'm also looking into these libraries which are like putting javascript on steroid:

gpu.js - accelerating matrix operations with WebGL & GPU
node-cuda CUDA wrapper for nodeJS

From my understanding, the data structure in synaptic.js is more like a linked-list implementation. If we re-model everything into matrices, it may lose flexibility in configuring the architecture.

UniqueFool commented 8 years ago

note that CUDA is nvidia specific (and specific to using node, i.e. basically excludes the browser scenario), while GPUs access via WebGL is unlikely to be what you want in a standalone node setup.

Jabher commented 8 years ago

There are 3 kinds of implementable concurrency: node.js one (where you can just call C++ code or CUDA), Web Worker concurrency, which is not freezing the UI, and WebCL one, which is nearly as fast as CUDA.

We should probably implement all of them once.

cazala / synaptic

Set this up to be parallelized #12