arrayfire / arrayfire-ml

ArrayFire's Machine Learning Library.
BSD 3-Clause "New" or "Revised" License
102 stars 23 forks source link

TODO List for 0.1 release #17

Open pavanky opened 9 years ago

pavanky commented 9 years ago

Base Classes

Autograd

Neural Network

Solvers / Optimizers

Examples

jramapuram commented 9 years ago

@pavanky : It doesn't necessarily have to be a classifier though. Neural networks can easily do regression type problems as well. Only difference would be the loss function used. I.e. L2 loss(regression) vs. Cross Entropy+Softmax(classification).

jramapuram commented 9 years ago

I think the following would be helpful from an API standpoint:

struct Model {
  int add(Layer layer_type);
  int compile(Optimizer opt, Loss loss, int max_iter = 200, bool early_stop = 1);
  float fit(DataSet train_data, DataSet target_data, std::tuple<DataSet, DataSet> validation = std::nullptr, float validation_split = 0.0f);  // for batch training up till either max_iter or early_stop if it is set. Has the option of either accepting cross validation data or splitting given data with ratio
  float train(DataSet train_data, DataSet target_data); // single step for online methods (can be called from fit)
  DataSet predict(DataSet test_data); // for evaluating new data
}

All the int's are return codes in the above.

This will give maximum flexibility in:

  1. Layer creation
  2. Model training
  3. Online + batch setting

The Layer class as you mentioned should do the following:

struct Layer{
  int connect(Layer prev); // to connect to previous layer in a deep net
  DataSet derivative(DataSet x);
  DataSet forwardPass(DataSet x);
  DataSet input(); //merely returns the input that was last used (or the output of the previous layer)
  Weights weights(); //returns just weights
  Bias bias(); // returns the bias (or stack of bias') if any (otherwise std::nullptr maybe? )
  std::map<std::string, std::string> conf(); //getter to return config of the layer itself.
}
futurely commented 9 years ago

According to many recent researches, the most powerful neural networks are no longer stacked layers but rather arbitrarily complex graphs, e.g. the bunches of advanced recursive neural networks, Facebook AI research's Memory Networks, Google DeepMind's Neural Turing Machine etc. So node is a more general name than layer.

connect can't reflect the direction of connections between the nodes. Except for the nodes that connect to the input data, the other nodes' input are the output of their predecessor nodes. Caffe uses Google Protocol Buffer to define and serialize the network. Apache Thrift which was open sourced by Facebook supports many more languages.

The following API is inspired by Caffe's Blob, Layer and Net.

typedef  shared_ptr<array> ArrayPtr;

class Data {
  public:
    explicit Data(vector<int>& size);
    int nDimension() const;
    vector<int> size();
    // Caffe exposes the raw CPU/GPU pointers to use in BLAS functions.
    // array has high level API. So there's no need to to so.
    ArrayPtr weights() const;
    ArrayPtr gradients() const;
}

typedef shared_ptr<Data> DataPtr;
typedef vector<DataPtr> DataPtrVec;

class Node {
 public:
  explicit Node(NodeParam& nodeParam);
  virtual ~Node();
  // Calls initNode which subclass can override
  void init();
  // Input and output are more general than the top and bottom of Caffe
  virtual void forward(const DataPtrVec& input, 
      const DataPtrVec& output);
  // propagate_back is more general than propagate_down of Caffe
  virtual void backward(const DataPtrVec& input,
      const vector<bool>& propagate_back,
      const DataPtrVec& output);
  // The model is DAG(Directed Acyclic Graph)
  // it's more intuitive for the predecessor to add the successor
  void addSuccessor(Node& node);
  void addSuccessors(vector<Node>& nodes);  
 protected:
  virtual initNode();
};

// Dtype is float or double
template <typename Dtype>
class Graph {
 public:
  explicit Graph(GraphParam& graphParam);
  virtual ~Graph();
  virtual forward(const DataPtrVec& inputs, DataPtrVec* outputs,
      Dtype* loss = NULL);
 /**
   * (Caffe) The network backward should take no input and output, since it solely
   * computes the gradient w.r.t the parameters, and the data has already been
   * provided during the forward pass.
   */
  virtual backward();
  Dtype forwardBackward(const DataPtrVec& inputs) {
    Dtype loss;
    DataPtrVec outputs;
    forward(inputs, &outputs, &loss);
    backward();
    return loss;
  }
};
futurely commented 9 years ago

Microsoft Research's "Computational Networks A Generalization of Deep Learning Models" presented its open source deep learning framework CNTK as image image image image

jramapuram commented 9 years ago

I like the generalizations. Few notes:

  void addSuccessor(Node& node);

seems redundant with:

  void addSuccessors(vector<Node>& nodes);

since a node can have subnodes as well, right? Why not just have a ptr to the next and previous nodes: a basic doubly-linked list.

One item that is very important that your schema is missing is some form of accuracy. 99% of the time you will need some form of data splitting and verification vs. a cross-validation data set. It would be a good idea to incorporate this right off the back.

pavanky commented 9 years ago

@jramapuram

If I am reading correctly, the following is for cases when a single node is connected to multiple successors (like the CN with shared params diagram).

void addSuccessors(vector<Node>& nodes);

I am not sure a linked list is the solution here.

pavanky commented 9 years ago

@futurely Thanks for the great feedback! The proposed API looks solid. However the one issue I see is that going the Node / Graph route means that the amount of parallelism will be decreased in networks that use more traditional layers. I wonder if we can specialize a Layer class as well that sits on top of Node to achieve this.

pavanky commented 9 years ago

@futurely Sorry for jumping the gun. I re-read the entire discussion again. It looks the proposed Node is just a generalized Layer.

pavanky commented 9 years ago

@futurely @jramapuram Would it be possible to continue the discussion over here: https://gitter.im/arrayfire/arrayfire_ml ?

pavanky commented 9 years ago

Suggestions from @alcinos:

alcinos commented 9 years ago

Here is a proposition including the modifications: https://gist.github.com/alcinos/3bedb2f7c4518fa93220

futurely commented 9 years ago

The Graph or Network class is not needed at all. Here's a simple illustration:

structure node
   [list of nodes] neighbors
   [data]
end

cost(X, Y) := if (X.neighbors contains Y) return X.neighbors[Y];
           else "Not possible"

list nodes;
alcinos commented 9 years ago

@futurely The point is that the net structure should be independant from the nodes. The same node can be involved in different topologies depending on the context. For example, in stacked auto encoders, the training is done layer by layer, which requires a different topology to train each level

futurely commented 9 years ago

22.

pavanky commented 9 years ago

@alcinos Can you explain how having an adjacency list helps in the situations you mentioned ? I still think it is the better option to have a centralized location for the representation, however I do not see it solving the problems of greedy layer by layer training.

alcinos commented 9 years ago

@pavanky Well let's say we have a 3 layers stacked auto-encoder. I will denote by "O" an output layer, "I" an input layer, "E" an encoding layer and "D" a decoding layer. The first step is to train greedily the first layer. This training is performed on the network : I -> E1 -> D1 -> O. After some number of iterations (or once the reconstruction error goes bellow a given threshold), we can train the second layer. This time, the network is I -> E1 -> E2 -> D2 -> O (we want the second layer to reconstruct the encoding of the first one) And so on for the others layers. The last step is a finetuning of the weights (by gradient descent), performed on the net: I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O

Eventually, depending on the applications, it is likely that the interesting part of the trained net is the output of E3 (high level features of the input). Once trained, we'll thus only use the first part of the net: I -> E1 -> E2 -> E3

The point is that in all those training steps, the architecture of the net is different, hence is makes more sense to store this architecture independently of the nodes. Moreover, several architectures can be used more or less concurrently for the same nodes: for example, we can use the net I -> E1 -> E2 -> E3 as a feature generator for some problem (control, classification,...), while constently improving the reconstruction error (of the full net I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O) given some new samples that comes from experience (the training set of the net is not always fully available from the beginning)

pavanky commented 9 years ago

I understand what autogenerators are doing. My question was more about implementation. What you are suggesting requires updating the adjacency list after each step or creating a new network after each step. Am I correct in this observation ?

alcinos commented 9 years ago

Absolutely!

pavanky commented 9 years ago

In that case I think we can have a specialized class for auto encoders (and RBMs) that will do the greedy training and return the encoding part of the network to be included into another network on request.

With that said, I think the other representation of the network (each node having list of links) can also be made to work in this situation (may be even a bit more cleanly) if the edges are labeled properly.

alcinos commented 9 years ago

In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?

jramapuram commented 9 years ago

An inverted multi leaf tree structure would make sense for this (i.e becoming inputs to multiple nodes) and then degenerate to a list. Decoupling a model from the node/layer structure does make sense.

On Tue, Jun 30, 2015, 7:21 PM alcinos notifications@github.com wrote:

In the setting where each node has a list of links, how would you deal with the same node belonging to several nets (simoultaneously) ?

— Reply to this email directly or view it on GitHub https://github.com/arrayfire/arrayfire_ml/issues/17#issuecomment-117266859 .

pavanky commented 9 years ago

@alcinos Each node can be linked to any number of other nodes. So you begin with adding all needed links. You then label which Networks the link belongs to. When training a particular network, you just traverse the relevant links and remove them when you are done.

alcinos commented 9 years ago

@pavanky mmh it seems that it can rapidly become a mess, especially because each entity that wants to use a given node in a net has to resolve label conflicts that may arise on its own (which is not good in term of encapsulation), and we also have to trust it on properly cleaning the links when the net is no longer needed (which is not good either).

pavanky commented 9 years ago

Indeed. Just put the option out there. I think I have most of the information I need to push something out in the next couple of days.

alcinos commented 9 years ago

Cool, I'm looking forward to take a look at it, and contribute if I have the opportunity

futurely commented 9 years ago

Examples of RBM implemented with high level math API from cudamat are as follows. https://github.com/cudamat/cudamat/blob/master/examples/rbm_cudamat.py https://github.com/cudamat/cudamat/blob/master/examples/rbm_numpy.py

alcinos commented 9 years ago

@futurely The code you are showing doesn't feature any encapsulation. This is a problem:

futurely commented 9 years ago

Caffe creator's plan on "Improving Caffe: Some Refactoring".

futurely commented 9 years ago

Minerva's DAG and operators based implementation.

pavanky commented 9 years ago

@futurely @alcinos @jramapuram I pushed some of the basic classes out. I do not have any convolution nodes yet. There is no network classes yet. I am going to push some sample network classes out later today to demonstrate how the structure can be stored.

The network class will also extend Node so that it can be used as a super node in a larger network.

I understand that this is not much right now, but feedback is welcome.

jramapuram commented 9 years ago

Looks great so far @pavanky ! Few queries / comments:

I will take a deeper look a little later, but good job!

unbornchikken commented 9 years ago

@pavanky @jramapuram I believe forward and backward computations have to be refactored from the base interface too, because I can see no way to support forward phase gradient computations with the current Node class. I mean RTLR for example.

jramapuram commented 9 years ago

@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)

unbornchikken commented 9 years ago

But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward. I meant that if traversal logic is refactored out of Node interfaces in separate classes (ForwardPhase, BackwardPhase, RTLRPhase, etc) its easy to define separate ones for every possible algorithm AFML want to support in the future and after the future.

jramapuram commented 9 years ago

I think we are might be talking about two separate things. Are you talking about Real-Time-Recurrent-Learning: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.9724&rep=rep1&type=pdf ?

unbornchikken commented 9 years ago

@jramapuram Exactly.

jramapuram commented 9 years ago

Exactly in that it is the same? Or different? I didn't find any mention online for RTLR. RTRL involves just computing the gradients of time n+1 using gradients and activations of time n in addiction to accumulating the gradients from 0 --> T_current. It is fully online. My solution of having virtual forward/backward functions should easily solve this problem..

unbornchikken commented 9 years ago

@jramapuram :) Yeah, I've trouble with acronyms, I often write LSTM as LTSM also. Sorry. Ok, but for calculating gradients you'll need to feed desired outputs and to do a special kind of forward iteration of the whole network for each weights:

http://www.willamette.edu/~gorr/classes/cs449/rtrl.html (18)

Or maybe @pavanky doesn't want to support RTRL altogether in favor of LSTM, which is a separate structure other than this.

jramapuram commented 9 years ago

@unbornchikken : no worries, just wanted to be on the same page :) . So RTRL is an optimization algorithm. It can be applied to LSTMs/GRU/...(insert your RNN flavor here). If you look at that link (specifically step # 7 & # 8) it isn't really anything fancy. You need to keep an accumultor for the error upto the current timestep as well as the gradient (as opposed to BPTT where things are unfolded all the way from T_current --> T0).

You can ignore a lot of the delta function stuff that is mentioned there. It is as in the paper, a way to unify the input/output pairs

unbornchikken commented 9 years ago

@jramapuram What I tried to say is that to support RTRL with your rigid Node abstract class, you gotta define some other methods in it accepting and holding this forward propagated p values. If you have some other special algorithm or trying to prototype one, you gotta modify the Node class for that too, which cannot be done at home, you gotta PR the master there. Am I right? But if Node class role changed to only hold and provide values, and those traversal algorithms could be defined in a separate classes on top of this basic interface, then you have the ability to design the next Realtime Backward-Forward Long Term Populated Short Theory of Everything at Once neural network learning algorithm by using ArrayFire ML. This ability is often missing from ML libraries out there because of this early decisions.

jramapuram commented 9 years ago

@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class.

If you look at Linear.hpp it implements it's own forward / backward. The easiest way to solve your problem is make a RecurrentNode class that inherits from Node. The details as to whether to use RTRL or BPTT can be resolved there.

you gotta modify the Node class for that too, which cannot be done at home

If you are implementing a new algorithm then you will have to do this. This is c++ , not python. Abstracting away forward/backward into it's own class doesn't seem like it would gain anything at this point. If anything it adds another level of indirection. I.e. you will still have to recompile the library.

Now, that being said it makes sense to have an Optimizer class. However, this is separate from the Nodes as optimizer just take a cost, a gradient and do some form of ascent/descent.

unbornchikken commented 9 years ago

@jramapuram You're talking about object oriented design and I'm talking about composite design there. In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates along with the input, and an other forward method propagating p values - declared in RTRLNode class. In my case Node doesn't have forward and backward methods, only provide connections and node, weights and other weight related values. Forward, backward, etc cases are implemented separately, and an Optimizer component groups them to provide training functionality.

jramapuram commented 9 years ago

In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates

RTRLNode will have two private internal members that merely accumulate the error & derivatives locally. All RTRL is doing is estimating derivatives for t+1 using derivatives for t (this is done on a node by node basis). Each unit is a linear combination of the previous layer's activations coupled with a gemv call. The only thing that needs to be passed along is the current update (which is already being done).

oriented design and I'm talking about composite design

In either case, how does your solution help you prototype faster?

unbornchikken commented 9 years ago

Maybe this is not the right issue to discuss about the internals of RTRL, but we can agree about that you will end up having an RTLNode class that have to throw not supported exception in its backward method, right? That's because your fundamental design hardcoded the ability to serve a single algorithm suite. Btw, I haven't talked about anyhing that helps pavan to prototype faster. I've talked about something that I believe makes the fundamental design more extensible. It was just only that: "I believe forward and backward computations have to be refactored from the base interface too" :)

pavanky commented 9 years ago

Why is lr > 1 in your example for perceptron? Generally it is << 1.0

The LR was greater than 1 because I knew it was going to converge for that simple simple test.

Are the thoughts to make LinearLayer : LinearNode , etc ? I like this idea

Each "Node" here is the equivalent of a "Layer" from caffe and "Module" from torch-nn.

normalize() should be factored out of Node in my opinion (into a Normalize class). There are many times of normalization that one might decide to use for example.

If you can point to some examples, I will look into this.

I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)

That is how things are at the moment..

But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward.

Each "Node" can be a simple layer or a composite node like Multi Layer Perceptrons, Recurrent Neural networks, Autoencoders and Restricted Boltzman Machines etc. The composite nodes can be used as is or by plugging them into a larger network. The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network. For training these networks, the methods used will obviously be different.

That said, none of the API is final and will change until we can address all the problems. This project is still in the prototyping phase afterall.

P.S. If you want to have prolonged discussions, can you move to the gitter room instead :-)

jramapuram commented 9 years ago
const double lr = 10;

@pavanky : This is what I was referring to in perceptron.cpp

That is how things are at the moment..

Yea, noticed that in the later comments. I need to learn to read :)

If you can point to some examples, I will look into this.

Do you mean examples of other normalization strategies? Example would be whitening using SVD

@unbornchikken : we can discuss further if you like, but I think the only other area where removing the forward/backward paradigm would be when straying away from neural networks. In that case you can simply ignore 'backward' . Otherwise, extending them should prove sufficient. RTRL still have a forward and a backward btw, you don't need to throw an undefined exception.

pavanky commented 9 years ago

@jramapuram The LR was only high because the test (binary AND) is a linear operation and weights will always get updated in the same direction once the algorithm starts. I just wanted to speed things up a little while testing. I understand why learning rates usually are small, but for this case it does not really matter.

pavanky commented 9 years ago

Do you mean examples of other normalization strategies? Example would be whitening using SVD

I do not see how such strategies can be applied at each node. From what I understand this can be calculated near the output and the norm can be propagated back for scaling up / down. Perhaps I should rename the method to simply say "scale"

jramapuram commented 9 years ago

Scale sounds more appropriate imo. But do we want to do internal node scaling? Generally the scaling op should be done prior (on the dataset directly) or using something like batch normalization

pavanky commented 9 years ago

I agree. I will just merge normalize and update step for now.