More activation function options?

Adil-Iqbal commented 6 years ago

I noticed that the current iteration of the toy-nn library has two activation functions; the sigmoid (aka logistic) function and the TanH function.

There are actually a ton of activation functions and their derivatives listed on Wikipedia. Reference: https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

If you're interested, I wouldn't mind transcribing a few more.

Maik1999 commented 6 years ago

Actually it only works with derivatives that can work with the "already functioned" value -> f'( f(x) ) Maybe this should be changed then too.

Adil-Iqbal commented 6 years ago

Agreed.

If we want the neural network to know which matrices to apply the activation function to, we would need to add boolean properties to the ActivationFunction class.

class ActivationFunction{
  constructor(func, dfunc, usePreVals=false){
    this.func = func;
    this.dfunc = dfunc;
    this.usePreVals = usePreVals;
  }
}

let sigmoid = new ActivationFunction(
  x => 1 / (1 + Math.exp(-x)),
  y => y * (1 - y)
);

let relu = new ActivationFunction(
  x => x < 0 ? 0 : x,
  y => y < 0 ? 0 : 1,
  usePreVals = true
);

Such properties could inform the NeuralNetwork class what to do.

if (this.activation_function.usePreVals){
    let gradients = Matrix.map(prevOutputs, this.activation_function.dfunc);
  } else {
    let gradients = Matrix.map(outputs, this.activation_function.dfunc)
  }

Implementing the feature wouldn't be too difficult.

That said, if Mr. Shiffman is eventually going to switch over to deeplearn.js, the Graph class has a lot of the most broadly used activation functions included as methods. Reference: https://deeplearnjs.org/docs/api/classes/graph.html Example: https://www.robinwieruch.de/neural-networks-deeplearnjs-javascript/

Not sure if I should proceed. Any thoughts?

Versatilus commented 6 years ago

I have the same sorts of feelings. I'm not sure I can advocate for adding a lot of configurability to a library that's made for quick prototyping and ease of learning. I don't think it's over-the-top yet, but it's getting there pretty quickly.

I think ReLU and leaky ReLU should be implemented, but I wouldn't put much effort into it beyond that.

On Mon, Feb 12, 2018 at 2:02 PM, Adil Iqbal notifications@github.com wrote:

Confirmed. Sigmoid and TanH are used on already functioned values.

If we absolutely had to get around this problem, we would have to use static functions more frequently to have access to pre-functioned matricies. Then also add a boolean property to the NeuralNetwork class that would allow the NeuralNetwork to know whether it needs to apply the activation function to pre-functioned or post-functioned matricies. Such a NeuralNetwork property would have to be informed by adding a boolean property to the ActivationFunction class that would encode which activation functions required special treatment and which didn't. This would open the door for things like leaky-ReLU and Gaussian activation functions. Doable, if absolutely necessary. That's very doable in my opinion.

That said, if Mr. Shiffman is eventually going to switch over to deeplearn.js, the Graph class has most of the most broadly used activation functions included as methods. Reference: https://deeplearnjs.org/docs/api/classes/graph.html Example: https://www.robinwieruch.de/neural-networks-deeplearnjs- javascript/

Not sure if I should proceed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CodingTrain/Toy-Neural-Network-JS/issues/70#issuecomment-365077788, or mute the thread https://github.com/notifications/unsubscribe-auth/ALpeda7mpnMgB2gBxo3f5FIipHE3pYTPks5tULTigaJpZM4SCtxP .

Adil-Iqbal commented 6 years ago

I went ahead and put in a little bit of effort. I need to make sure they work so if you guys wouldn't mind giving me some feedback before I submit a pull request, I'd appreciate it. ^_^

Since I had to duplicate pre-functioned values for both hidden and output layers, I decided to add a duplicate function to the matrix library. Which I've written tests for, as well.

Changes Made: https://github.com/Adil-Iqbal/Toy-Neural-Network-JS/commit/4ca648c73659b57a6486ce8b1cd842b29c1b5e97#diff-e8acc63b1e238f3255c900eed37254b8

Forked Repository: https://github.com/Adil-Iqbal/Toy-Neural-Network-JS/tree/master/lib

Versatilus commented 6 years ago

I like these changes. This allows a lot more flexibility.

The changes are technically not necessary to implement ReLU because the backwards function returns constants around the same pivot point used in the forward function.

On Tue, Feb 13, 2018 at 7:45 AM, Adil Iqbal notifications@github.com wrote:

I went ahead and put in a little bit of effort. This is what the changes might look like. I need to make sure they work so if you guys wouldn't mind giving me some feedback, I'd appreciate it!

Changes Made: Adil-Iqbal/Toy-Neural-Network-JS@4ca648c#diff- e8acc63b1e238f3255c900eed37254b8 https://github.com/Adil-Iqbal/Toy-Neural-Network-JS/commit/4ca648c73659b57a6486ce8b1cd842b29c1b5e97#diff-e8acc63b1e238f3255c900eed37254b8

Forked Repository: https://github.com/Adil-Iqbal/ Toy-Neural-Network-JS/tree/master/lib

Since I had to duplicate pre-functioned values for both hidden and output layers, I decided to add a duplicate function to the matrix library. Which I've written tests for, as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CodingTrain/Toy-Neural-Network-JS/issues/70#issuecomment-365306570, or mute the thread https://github.com/notifications/unsubscribe-auth/ALpedRmYPUdETI0lJUVftOqDzEQ_Mw2jks5tUa4ZgaJpZM4SCtxP .

Adil-Iqbal commented 6 years ago

That should be as easy as deleting line 41.

I'm in the process of testing the code to make sure it works as intended.

Adil-Iqbal commented 6 years ago

Just did a round of testing. All of the activation functions were tested on the XOR example that was included in the library. Below is a list of observations...

All but two activation functions were able to solve XOR. I discuss the two that were not able to solve the problem in the last item of this list.
Best performer was gaussian with leaky_relu being a close 2nd.
Worst performer was relu. It can solve the problem sometimes, though there would be times the screen would go completely black. I was reading an article about activation functions and they mentioned something called "the dying ReLU problem." I suppose this is a manifest of that.
It turns out, the use_X_values property of the ActivationFunction class is really important. All of the activation functions were tried with and without the use_X_values property, and the functions which require it will refuse to learn without it.
The two functions that were totally unable to solve XOR were arctan and softplus. I checked their formulas on Wikipedia and they are correct as well as I can tell. They were both equally bad with and without the use_X_values property. The arctan function would flicker the canvas between pitch black and pure white. The softplus function would choose black or white and all rectangles on the canvas would remain that color until I refreshed the page. I'm not sure what is wrong with those two.

Next Step: I'm going to test them all on the MNIST example. I would appreciate any advice and criticism you guys can give me. I'd like to get arctan and softplus to work, otherwise I may not include them in the pull request.

Adil-Iqbal commented 6 years ago

Okay, completed the MNIST testing during lunch. The results were disappointing. I have pasted the results of the testing below...

Activation Function: Sigmoid.
Test Set #1: 68.29%
Test Set #2: 84.41%
Test Set #3: 88.98%
First 90%+ Accuracy Test Set: #7.
Time to 10th Test Set: 01m 32s 997ms

Activation Function: TanH.
Test Set #1: 13.44%
Test Set #2: 14.64%
Test Set #3: 16.24%
First 90%+ Accuracy Test Set: Accuracy at 5m was 13.59%.
Time to 10th Test Set: 1m 33s 156ms

Activation Function: ArcTan.
Test Set #1: 13.86%
Test Set #2: 14.27%
Test Set #3: 14.38%
First 90%+ Accuracy Test Set: Accuracy at 5m was 15.58%.
Time to 10th Test Set: 1m 34s 684ms

Activation Function: Softsign.
Test Set #1: 9.83%
Test Set #2: 9.95%
Test Set #3: 10.21%
First 90%+ Accuracy Test Set: Accuracy at 5m was 9.8%.
Time to 10th Test Set: 1m 31s 931ms
Note: Test sets #4 - #38 were all 9.8% accuracy.

Activation Function: ReLU.
Test Set #1: 9.81%
Test Set #2: 9.8%
Test Set #3: 9.8%
First 90%+ Accuracy Test Set: Accuracy at 5m was 9.8%.
Time to 10th Test Set: 1m 47s 351ms
Note: Test sets #2 - #32 were all 9.8% accuracy.

Activation Function: Leaky ReLU.
Test Set #1: 9.8%
Test Set #2: 9.8%
Test Set #3: 9.8%
First 90%+ Accuracy Test Set: Accuracy at 1m 1s was 9.8%.
Time to 10th Test Set: 1m 40s 897ms
Note: Test sets #1 - #10 were all 9.8% accuracy. The test was terminated at 1m 40s.

Activation Function: SoftPlus.
Test Set #1: 9.8%
Test Set #2: 9.8%
Test Set #3: 9.8%
First 90%+ Accuracy Test Set: Accuracy at 1m 49s was 9.8%.
Time to 10th Test Set: 1m 49s 399ms
Note: Test sets #1 - #10 were all 9.8% accuracy. The test was terminated at 1m 49s.

Activation Function: Gaussian.
Test Set #1: 16.75%
Test Set #2: 21.23%
Test Set #3: 23.58%
First 90%+ Test Set: Accuracy at 5m was 32.01%.
Time to 10th Test Set: 1m 34s 598ms

Here are my observations:

It seems that relu, leaky_relu, softsign, and softplus are not able to predict the digits any better than random chance.
tanh, arctan, and gaussian were able to predict the digits at slightly better than random. I feel comfortable saying that because the sample size for the test set is 10,000 digits.
sigmoid was clearly and by far the best performer.

Not sure what this means. A lot of the functions that failed this test solved XOR with flying colors. I could use some help.

Versatilus commented 6 years ago

There's a pretty significant problem with the toy neural network library. A lot of the sums going into the activation function are fairly large. Too large. I thought this was going to be a problem during the live stream, but the law of averages saves sigmoid. Trying an activation function like ReLU makes everything explode horribly.

I have compensated for this in the past by multiplying each node in each layer by 1/<layer input count> before activation. This stops numerical Armageddon, but it also reduces network accuracy a little bit. I'm guessing it has to be compensated for in the backwards pass, but I haven't figured that out yet.

I've been getting weird results this morning, which is part of why I haven't posted anything about it.

Versatilus commented 6 years ago

Specifically in the case of MNIST, each number of each hidden node is the sum of 768 values between -1 and +1, with that sum passed through the activation function. Those numbers will occasionally be larger than approximately +/-16, which is where the sigmoid function breaks down. With ReLU, the errors end up being so large that trying to adjust for them produces nonsensical results. A different error function might also help there.

Adil-Iqbal commented 6 years ago

Thanks @Versatilus!

I've figured it out! Nothing is broken at all.

With MNIST we are trying to predict digits based on probabilities.

The sigmoid function is designed to squish the x value between 0 and 1. These are probabilities that we can work with.

Functions like relu and softplus range off into infinity. No probabilities are being created. They are simply not meant to be used for this purpose.

The reason why gaussian did as well as it did is because it's operating in the same range as sigmoid (it just inverts after x > 0).

I've decided to keep arctan and softplus in the pull request. I think we cannot judge an activation function based on one or two use-cases. If the community decides that they should be deleted at a later time, or not included at all. They will voice that opinion.

Versatilus commented 6 years ago

Someone reminded me a few minutes ago that ReLU should work if the weights are initialized properly for it. You're right. The right tool should be used for the right job.

On Wed, Feb 14, 2018 at 1:12 PM, Adil Iqbal notifications@github.com wrote:

Thanks @Versatilus https://github.com/versatilus!

I've figured it out! Nothing is broken at all.

With MNIST we are trying to predict digits based on probabilities.

The sigmoid function is designed to squish the x value between 0 and 1. These are probabilities that we can work with.

Functions like relu and softplus range off into infinity. No probabilities are being created. They are simply not meant to be used for this purpose.

The reason why gaussian did as well as it did is because it's operating in the same range as sigmoid (it just inverts after x > 0) https://www.desmos.com/calculator/mfsaj5iopn.

I've decided to keep arctan and softplus in the pull request. I think we cannot judge an activation function based on one or two use-cases. If the community decides that they should be deleted at a later time, or not included at all. They will voice that opinion.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CodingTrain/Toy-Neural-Network-JS/issues/70#issuecomment-365746546, or mute the thread https://github.com/notifications/unsubscribe-auth/ALpedTXi27w_ywnSHEv72bPehbgqd-FRks5tU0wugaJpZM4SCtxP .

Adil-Iqbal commented 6 years ago

Someone reminded me a few minutes ago that ReLU should work if the weights are initialized properly for it.

For sure. Though, it's hard to initialize those weights properly without knowing what the user's data set looks like. Ultimately, I think it's up to the user to adjust his inputs accordingly before passing into the NN. Though I might be mistaken. We can update as needed.

Adil-Iqbal commented 6 years ago

@Versatilus Hey man. I'd like to apologize.

I came home from work and read your posts again. I think you might have been on to something so I ran another test on relu. I thought the best way for us to prevent the exploding gradient you were talking about was to set the learning rate really low (I set it to 0.001). Here was the result...

Activation Function: ReLU
Test Set #1: 14.06%
Test Set #2: 18.87%
Test Set #3: 22.14%
First 90%+ Accuracy Test Set: Accuracy at 5m was 50.03%.
Time to 10th Test Set: 1m 52s 658ms

I still think that setting the learning rate should ultimately be at the behest of the user. So if that person were to use relu, it's on them to set the learning rate in their sketch. They could even do it dynamically, for example:

switch(true) {
    case (accuracy >= 0.08):
        nn.setLearningRate(0.001);
        break;
    case (accuracy >= 0.5):
        nn.setLearningRate(0.0005);
        break;
   case (accuracy >= 0.75):
        nn.setLearningRate(0.0001);
        break;
    default:
        nn.setLearningRate();
}

Or something like that. I'm sure I'm messing up the syntax. Either way. The code I submitted is okay. And you are correct.

Versatilus commented 6 years ago

I'm not sweating it. :-)

In the end, we are both right. It's a combination of all of those things. This is a complicated subject with a lot of little details and I'm just happy that Dan is covering it in a way that is approachable.

This library is still very much a work in progress. I know from reading some of the other pull requests that Dan wants to get to a lot of these issues and tackle them in the live streams. I think we should just keep making comments and making pull requests like we have been so that he knows what users want to focus on when he streams.

I don't know what his plans are, but I think it would be good if one of the next things he does on the subject in the stream is either a more sophisticated way of choosing weights or adjusting the learning rate, or both. I also hear chatter that basic convolutions might be coming soon.

Adil-Iqbal commented 6 years ago

I got arctan and softplus to solve the XOR problem! It turns out I was forgetting my order of operations. I needed some parentheses.

Changes Made: https://github.com/Adil-Iqbal/Toy-Neural-Network-JS/commit/0227dfd71ce533b8e11df334d7737456bb841ebb#diff-e8acc63b1e238f3255c900eed37254b8

I'll do another round of MNIST testing during my lunch break.

Adil-Iqbal commented 6 years ago

@xxMrPHDxx has found more order of operations quirk. I'm going to repeat all of my testing. This is great!

Adil-Iqbal commented 6 years ago

We should continue this conversation in #75. Thanks for the help y'all.

CodingTrain / Toy-Neural-Network-JS

More activation function options? #70