restructuring of ANN for flexibility and performance gains

beeedy commented 7 years ago

First off, what you are working on here is immensely impressive. I just wanted to point out somethings I have learned implementing NN's myself and pass along any possible insight.

If I understand the structure currently, you have an overarching NN class that contains references to a layer class which itself contains references to your final neurons. A possible simplification you may want to look into is to completely eliminate the neurons class and rather represent each layer in the network as a single 2D vector/tensor with dimension mxn where m is the number of neurons in the layer and n is the number of neurons in the previous layer. With this approach you can calculate forward propagation at each layer by taking said layers mxn vector/tensor and dotting it with the previous layers output vector/tensor, resulting in an output vector/tensor that can either be used to feed into the next layer or be used as the output of the network as a whole.

So while this approach simplifies your forward propagation, it also simplifies your back propagation as you can use the same dot product to go back through your layers and calculate the amount by which you should adjust the weights. If you have a layers input vector, it's output error as a vector, and a vector containing the derivative of the activation function at each output, the amount each weight should be adjusted is the dot product of the input vector and (output error vector * derivative value error). Hopefully I explained that well enough, apologizes if not :(

You have eluded to a desire to implement some performance increases using Metal down the road and I feel you may also find dot products are ideal for parallelization on a GPU.

Any who, feel free to ignore this but I just wanted to pass it along. Excited to see where this project ends up!

Somnibyte commented 7 years ago

Hey @beeedy, thank you so much for your advice. I truly believe that the method you advised for simplifying the NN using matrix and vector operations is more feasible than the solution MLKit currently provides. Which is why I will be taking the steps to overhaul the NN and revise it so that later down the road the Metal implementation will become much more easier. Thank you again. I have created a new branch just now if you are interested in contributing or checking out updates for the new NN class. I'll be working on it very soon.

beeedy commented 7 years ago

@iamtrask has a good example of a 3 layer network using this implementation in python that I will paste below. Here is the article itself this example was pulled from: http://iamtrask.github.io/2015/07/12/basic-python-network/

I will take a stab at helping out if/when I am able to find the time to do so!

import numpy as np

def nonlin(x,deriv=False):
    if(deriv==True):
        return x*(1-x)

    return 1/(1+np.exp(-x))

X = np.array([[0,0,1],
            [0,1,1],
            [1,0,1],
            [1,1,1]])

y = np.array([[0],
            [1],
            [1],
            [0]])

np.random.seed(1)

# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

for j in xrange(60000):

    # Feed forward through layers 0, 1, and 2
    l0 = X
    l1 = nonlin(np.dot(l0,syn0))
    l2 = nonlin(np.dot(l1,syn1))

    # how much did we miss the target value?
    l2_error = y - l2

    if (j% 10000) == 0:
        print "Error:" + str(np.mean(np.abs(l2_error)))

    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    l2_delta = l2_error*nonlin(l2,deriv=True)

    # how much did each l1 value contribute to the l2 error (according to the weights)?
    l1_error = l2_delta.dot(syn1.T)

    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    l1_delta = l1_error * nonlin(l1,deriv=True)

    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)

Variable	Definition
X	Input dataset matrix where each row is a training example
y	Output dataset matrix where each row is a training example
l0	First Layer of the Network, specified by the input data
l1	Second Layer of the Network, otherwise known as the hidden layer
l2	Final Layer of the Network, which is our hypothesis, and should approximate the correct answer as we train.
syn0	First layer of weights, Synapse 0, connecting l0 to l1.
syn1	Second layer of weights, Synapse 1 connecting l1 to l2.
l2_error	This is the amount that the neural network "missed".
l2_delta	This is the error of the network scaled by the confidence. It's almost identical to the error except that very confident errors are muted.
l1_error	Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.
l1_delta	This is the l1 error of the network scaled by the confidence. Again, it's almost identical to the l1_error except that confident errors are muted.

Somnibyte commented 7 years ago

@beeedy I have a working prototype of the new Neural Network architecture based on this tutorial. You can checkout the branch here. There is a playground available in that repository, feel free to work on experiments there.

beeedy commented 7 years ago

@Somnibyte I looked over your changes during my lunch and they are looking good! When you say you have a NN architecture working based off that tutorial does that mean you have hand writing recognition working? If so, that would be an impressive demonstration of the libraries capabilities and be worth adding as an example!

I might make a PR over the weekend if I can find some time to implement some small changes. A ReLU and leaky-ReLU would be good activation functions to add, read more about them here: https://en.wikipedia.org/wiki/Rectifier_(neural_networks) http://www.kdnuggets.com/2016/03/must-know-tips-deep-learning-part-2.html

One thing I noticed is you have weights and biases sort of separated. Now there is nothing wrong with this at all, but I just wanted to bring your attention to one way this is commonly implemented. What you will sometimes see is people adding an extra neuron to each layer that is fixed at a constant value (usually 1.0). This neuron never takes any inputs (or all its input weights are stuck at 0.0) and will always output a value 1.0 to the next layer. This 'special' neuron effectively wraps up all the functionality of a bias without having to explicitly go through and train weights and biases separately. No reason to really go back and change this unless you feel so inclined, just wanted to bring it to your attention as you will most likely run into it implemented in this way at some point :)

I will take some time this evening hopefully to play around more in-depth with the changes! All in all, very good work!

Somnibyte commented 7 years ago

@beeedy Thank you for your feedback! I'm planning on working on a MNIST digit handwriting example where MLKit provides a separate example project where users can draw digits and the example app will try to predict what digit was drawn. Currently, this branch does not include the hand writing example. Thank you for the links on ReLU, I'll definitely give those a good read later on today. Also, good note on the bias implementation. What you described is what I did (sort of) in the last version of MLKit except the programmer had to input a value of 1 by themselves. This version was based on how the tutorial handled bias, the user can manually make the bias values all 1, but to make it easier I'll try to package the weights and bias together soon in an upcoming update.

Somnibyte / MLKit

restructuring of ANN for flexibility and performance gains #2