Open pavanky opened 9 years ago
@pavanky : It doesn't necessarily have to be a classifier though. Neural networks can easily do regression type problems as well. Only difference would be the loss function used. I.e. L2 loss(regression) vs. Cross Entropy+Softmax(classification).
I think the following would be helpful from an API standpoint:
struct Model {
int add(Layer layer_type);
int compile(Optimizer opt, Loss loss, int max_iter = 200, bool early_stop = 1);
float fit(DataSet train_data, DataSet target_data, std::tuple<DataSet, DataSet> validation = std::nullptr, float validation_split = 0.0f); // for batch training up till either max_iter or early_stop if it is set. Has the option of either accepting cross validation data or splitting given data with ratio
float train(DataSet train_data, DataSet target_data); // single step for online methods (can be called from fit)
DataSet predict(DataSet test_data); // for evaluating new data
}
All the int's are return codes in the above.
This will give maximum flexibility in:
The Layer class as you mentioned should do the following:
struct Layer{
int connect(Layer prev); // to connect to previous layer in a deep net
DataSet derivative(DataSet x);
DataSet forwardPass(DataSet x);
DataSet input(); //merely returns the input that was last used (or the output of the previous layer)
Weights weights(); //returns just weights
Bias bias(); // returns the bias (or stack of bias') if any (otherwise std::nullptr maybe? )
std::map<std::string, std::string> conf(); //getter to return config of the layer itself.
}
According to many recent researches, the most powerful neural networks are no longer stacked layers but rather arbitrarily complex graphs, e.g. the bunches of advanced recursive neural networks, Facebook AI research's Memory Networks, Google DeepMind's Neural Turing Machine etc. So node is a more general name than layer.
connect
can't reflect the direction of connections between the nodes. Except for the nodes that connect to the input data, the other nodes' input are the output of their predecessor nodes. Caffe uses Google Protocol Buffer to define and serialize the network. Apache Thrift which was open sourced by Facebook supports many more languages.
The following API is inspired by Caffe's Blob, Layer and Net.
typedef shared_ptr<array> ArrayPtr;
class Data {
public:
explicit Data(vector<int>& size);
int nDimension() const;
vector<int> size();
// Caffe exposes the raw CPU/GPU pointers to use in BLAS functions.
// array has high level API. So there's no need to to so.
ArrayPtr weights() const;
ArrayPtr gradients() const;
}
typedef shared_ptr<Data> DataPtr;
typedef vector<DataPtr> DataPtrVec;
class Node {
public:
explicit Node(NodeParam& nodeParam);
virtual ~Node();
// Calls initNode which subclass can override
void init();
// Input and output are more general than the top and bottom of Caffe
virtual void forward(const DataPtrVec& input,
const DataPtrVec& output);
// propagate_back is more general than propagate_down of Caffe
virtual void backward(const DataPtrVec& input,
const vector<bool>& propagate_back,
const DataPtrVec& output);
// The model is DAG(Directed Acyclic Graph)
// it's more intuitive for the predecessor to add the successor
void addSuccessor(Node& node);
void addSuccessors(vector<Node>& nodes);
protected:
virtual initNode();
};
// Dtype is float or double
template <typename Dtype>
class Graph {
public:
explicit Graph(GraphParam& graphParam);
virtual ~Graph();
virtual forward(const DataPtrVec& inputs, DataPtrVec* outputs,
Dtype* loss = NULL);
/**
* (Caffe) The network backward should take no input and output, since it solely
* computes the gradient w.r.t the parameters, and the data has already been
* provided during the forward pass.
*/
virtual backward();
Dtype forwardBackward(const DataPtrVec& inputs) {
Dtype loss;
DataPtrVec outputs;
forward(inputs, &outputs, &loss);
backward();
return loss;
}
};
Microsoft Research's "Computational Networks A Generalization of Deep Learning Models" presented its open source deep learning framework CNTK as
I like the generalizations. Few notes:
void addSuccessor(Node& node);
seems redundant with:
void addSuccessors(vector<Node>& nodes);
since a node can have subnodes as well, right? Why not just have a ptr to the next and previous nodes: a basic doubly-linked list.
One item that is very important that your schema is missing is some form of accuracy. 99% of the time you will need some form of data splitting and verification vs. a cross-validation data set. It would be a good idea to incorporate this right off the back.
@jramapuram
If I am reading correctly, the following is for cases when a single node is connected to multiple successors (like the CN with shared params diagram).
void addSuccessors(vector<Node>& nodes);
I am not sure a linked list is the solution here.
@futurely Thanks for the great feedback! The proposed API looks solid. However the one issue I see is that going the Node / Graph route means that the amount of parallelism will be decreased in networks that use more traditional layers. I wonder if we can specialize a Layer
class as well that sits on top of Node
to achieve this.
@futurely Sorry for jumping the gun. I re-read the entire discussion again. It looks the proposed Node
is just a generalized Layer
.
@futurely @jramapuram Would it be possible to continue the discussion over here: https://gitter.im/arrayfire/arrayfire_ml ?
Suggestions from @alcinos:
Node
, it may be better to have an adjacency list in the Network
class.Here is a proposition including the modifications: https://gist.github.com/alcinos/3bedb2f7c4518fa93220
The Graph or Network class is not needed at all. Here's a simple illustration:
structure node
[list of nodes] neighbors
[data]
end
cost(X, Y) := if (X.neighbors contains Y) return X.neighbors[Y];
else "Not possible"
list nodes;
@futurely The point is that the net structure should be independant from the nodes. The same node can be involved in different topologies depending on the context. For example, in stacked auto encoders, the training is done layer by layer, which requires a different topology to train each level
@alcinos Can you explain how having an adjacency list helps in the situations you mentioned ? I still think it is the better option to have a centralized location for the representation, however I do not see it solving the problems of greedy layer by layer training.
@pavanky Well let's say we have a 3 layers stacked auto-encoder. I will denote by "O" an output layer, "I" an input layer, "E" an encoding layer and "D" a decoding layer. The first step is to train greedily the first layer. This training is performed on the network : I -> E1 -> D1 -> O. After some number of iterations (or once the reconstruction error goes bellow a given threshold), we can train the second layer. This time, the network is I -> E1 -> E2 -> D2 -> O (we want the second layer to reconstruct the encoding of the first one) And so on for the others layers. The last step is a finetuning of the weights (by gradient descent), performed on the net: I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O
Eventually, depending on the applications, it is likely that the interesting part of the trained net is the output of E3 (high level features of the input). Once trained, we'll thus only use the first part of the net: I -> E1 -> E2 -> E3
The point is that in all those training steps, the architecture of the net is different, hence is makes more sense to store this architecture independently of the nodes. Moreover, several architectures can be used more or less concurrently for the same nodes: for example, we can use the net I -> E1 -> E2 -> E3 as a feature generator for some problem (control, classification,...), while constently improving the reconstruction error (of the full net I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O) given some new samples that comes from experience (the training set of the net is not always fully available from the beginning)
I understand what autogenerators are doing. My question was more about implementation. What you are suggesting requires updating the adjacency list after each step or creating a new network after each step. Am I correct in this observation ?
Absolutely!
In that case I think we can have a specialized class for auto encoders (and RBMs) that will do the greedy training and return the encoding part of the network to be included into another network on request.
With that said, I think the other representation of the network (each node having list of links) can also be made to work in this situation (may be even a bit more cleanly) if the edges are labeled properly.
In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?
An inverted multi leaf tree structure would make sense for this (i.e becoming inputs to multiple nodes) and then degenerate to a list. Decoupling a model from the node/layer structure does make sense.
On Tue, Jun 30, 2015, 7:21 PM alcinos notifications@github.com wrote:
In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?
— Reply to this email directly or view it on GitHub https://github.com/arrayfire/arrayfire_ml/issues/17#issuecomment-117266859 .
@alcinos Each node can be linked to any number of other nodes. So you begin with adding all needed links. You then label which Networks the link belongs to. When training a particular network, you just traverse the relevant links and remove them when you are done.
@pavanky mmh it seems that it can rapidly become a mess, especially because each entity that wants to use a given node in a net has to resolve label conflicts that may arise on its own (which is not good in term of encapsulation), and we also have to trust it on properly cleaning the links when the net is no longer needed (which is not good either).
Indeed. Just put the option out there. I think I have most of the information I need to push something out in the next couple of days.
Cool, I'm looking forward to take a look at it, and contribute if I have the opportunity
Examples of RBM implemented with high level math API from cudamat are as follows. https://github.com/cudamat/cudamat/blob/master/examples/rbm_cudamat.py https://github.com/cudamat/cudamat/blob/master/examples/rbm_numpy.py
@futurely The code you are showing doesn't feature any encapsulation. This is a problem:
Caffe creator's plan on "Improving Caffe: Some Refactoring".
Minerva's DAG and operators based implementation.
@futurely @alcinos @jramapuram I pushed some of the basic classes out. I do not have any convolution nodes yet. There is no network classes yet. I am going to push some sample network classes out later today to demonstrate how the structure can be stored.
The network class will also extend Node
so that it can be used as a super node in a larger network.
I understand that this is not much right now, but feedback is welcome.
Looks great so far @pavanky ! Few queries / comments:
I will take a deeper look a little later, but good job!
@pavanky @jramapuram I believe forward and backward computations have to be refactored from the base interface too, because I can see no way to support forward phase gradient computations with the current Node class. I mean RTLR for example.
@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)
But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward. I meant that if traversal logic is refactored out of Node interfaces in separate classes (ForwardPhase, BackwardPhase, RTLRPhase, etc) its easy to define separate ones for every possible algorithm AFML want to support in the future and after the future.
I think we are might be talking about two separate things. Are you talking about Real-Time-Recurrent-Learning: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.9724&rep=rep1&type=pdf ?
@jramapuram Exactly.
Exactly in that it is the same? Or different? I didn't find any mention online for RTLR. RTRL involves just computing the gradients of time n+1 using gradients and activations of time n in addiction to accumulating the gradients from 0 --> T_current. It is fully online. My solution of having virtual forward/backward functions should easily solve this problem..
@jramapuram :) Yeah, I've trouble with acronyms, I often write LSTM as LTSM also. Sorry. Ok, but for calculating gradients you'll need to feed desired outputs and to do a special kind of forward iteration of the whole network for each weights:
http://www.willamette.edu/~gorr/classes/cs449/rtrl.html (18)
Or maybe @pavanky doesn't want to support RTRL altogether in favor of LSTM, which is a separate structure other than this.
@unbornchikken : no worries, just wanted to be on the same page :) . So RTRL is an optimization algorithm. It can be applied to LSTMs/GRU/...(insert your RNN flavor here). If you look at that link (specifically step # 7 & # 8) it isn't really anything fancy. You need to keep an accumultor for the error upto the current timestep as well as the gradient (as opposed to BPTT where things are unfolded all the way from T_current --> T0).
You can ignore a lot of the delta function stuff that is mentioned there. It is as in the paper, a way to unify the input/output pairs
@jramapuram What I tried to say is that to support RTRL with your rigid Node abstract class, you gotta define some other methods in it accepting and holding this forward propagated p
values. If you have some other special algorithm or trying to prototype one, you gotta modify the Node class for that too, which cannot be done at home, you gotta PR the master there. Am I right? But if Node class role changed to only hold and provide values, and those traversal algorithms could be defined in a separate classes on top of this basic interface, then you have the ability to design the next Realtime Backward-Forward Long Term Populated Short Theory of Everything at Once neural network learning algorithm by using ArrayFire ML. This ability is often missing from ML libraries out there because of this early decisions.
@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class.
If you look at Linear.hpp it implements it's own forward / backward. The easiest way to solve your problem is make a RecurrentNode class that inherits from Node. The details as to whether to use RTRL or BPTT can be resolved there.
you gotta modify the Node class for that too, which cannot be done at home
If you are implementing a new algorithm then you will have to do this. This is c++ , not python. Abstracting away forward/backward into it's own class doesn't seem like it would gain anything at this point. If anything it adds another level of indirection. I.e. you will still have to recompile the library.
Now, that being said it makes sense to have an Optimizer class. However, this is separate from the Nodes as optimizer just take a cost, a gradient and do some form of ascent/descent.
@jramapuram You're talking about object oriented design and I'm talking about composite design there. In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates along with the input, and an other forward method propagating p values - declared in RTRLNode class. In my case Node doesn't have forward and backward methods, only provide connections and node, weights and other weight related values. Forward, backward, etc cases are implemented separately, and an Optimizer
component groups them to provide training functionality.
In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates
RTRLNode will have two private internal members that merely accumulate the error & derivatives locally. All RTRL is doing is estimating derivatives for t+1 using derivatives for t (this is done on a node by node basis). Each unit is a linear combination of the previous layer's activations coupled with a gemv call. The only thing that needs to be passed along is the current update (which is already being done).
oriented design and I'm talking about composite design
In either case, how does your solution help you prototype faster?
Maybe this is not the right issue to discuss about the internals of RTRL, but we can agree about that you will end up having an RTLNode class that have to throw not supported exception in its backward method, right? That's because your fundamental design hardcoded the ability to serve a single algorithm suite. Btw, I haven't talked about anyhing that helps pavan to prototype faster. I've talked about something that I believe makes the fundamental design more extensible. It was just only that: "I believe forward and backward computations have to be refactored from the base interface too" :)
Why is lr > 1 in your example for perceptron? Generally it is << 1.0
The LR was greater than 1 because I knew it was going to converge for that simple simple test.
Are the thoughts to make LinearLayer : LinearNode , etc ? I like this idea
Each "Node" here is the equivalent of a "Layer" from caffe and "Module" from torch-nn.
normalize() should be factored out of Node in my opinion (into a Normalize class). There are many times of normalization that one might decide to use for example.
If you can point to some examples, I will look into this.
I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)
That is how things are at the moment..
But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward.
Each "Node" can be a simple layer or a composite node like Multi Layer Perceptrons, Recurrent Neural networks, Autoencoders and Restricted Boltzman Machines etc. The composite nodes can be used as is or by plugging them into a larger network. The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network. For training these networks, the methods used will obviously be different.
That said, none of the API is final and will change until we can address all the problems. This project is still in the prototyping phase afterall.
P.S. If you want to have prolonged discussions, can you move to the gitter room instead :-)
const double lr = 10;
@pavanky : This is what I was referring to in perceptron.cpp
That is how things are at the moment..
Yea, noticed that in the later comments. I need to learn to read :)
If you can point to some examples, I will look into this.
Do you mean examples of other normalization strategies? Example would be whitening using SVD
@unbornchikken : we can discuss further if you like, but I think the only other area where removing the forward/backward paradigm would be when straying away from neural networks. In that case you can simply ignore 'backward' . Otherwise, extending them should prove sufficient. RTRL still have a forward and a backward btw, you don't need to throw an undefined exception.
@jramapuram The LR was only high because the test (binary AND) is a linear operation and weights will always get updated in the same direction once the algorithm starts. I just wanted to speed things up a little while testing. I understand why learning rates usually are small, but for this case it does not really matter.
Do you mean examples of other normalization strategies? Example would be whitening using SVD
I do not see how such strategies can be applied at each node. From what I understand this can be calculated near the output and the norm can be propagated back for scaling up / down. Perhaps I should rename the method to simply say "scale"
Scale sounds more appropriate imo. But do we want to do internal node scaling? Generally the scaling op should be done prior (on the dataset directly) or using something like batch normalization
I agree. I will just merge normalize and update step for now.
Base Classes
Autograd
Neural Network
Solvers / Optimizers
Examples