Closed LukeAI closed 3 years ago
The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.
@AlexeyAB Was wondering if you're considering adding up Mish? In that regards, based on the above screenshot there is a mistake in the derivative formula which I have updated in my paper. The link to the updated paper and additional results are in my repository here - https://github.com/digantamisra98/Mish Thanks!
@digantamisra98
I added MISH-activation.
use activation=mish
in [convolutional] layers
Please, check that implementation is correct: https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213
Thanks! So the error was in delta
?
Just checked, The implementation is correct. Thanks. Yes the error was a typo in the delta
term.
now training.
now training.
usually get nan, do i need adjust learning rate schedule?
burn_in=2000
learning_rate=0.1
policy=poly
power=4
max_batches=1600000
@WongKinYiu What model do you try to train?
densenet based model. i ll try darknet based model first.
@WongKinYiu Here's the DenseNet code I used to test Mish - https://github.com/digantamisra98/Mish/blob/master/Notebooks/cifar-10-DenseNet121_Mish.ipynb Usually I'll advise to have a lower learning rate probably 1e-3 (0.01 - 0.001) Can you share the log maybe or the code to reproduce the NaN?
darknet based model also get nan.
[net]
batch=128
subdivisions=1
height=224
width=224
channels=3
momentum=0.9
decay=0.0005
max_crop=320
learning_rate=0.1
policy=poly
power=4
max_batches=1600000
[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=mish
[maxpool]
size=2
stride=2
padding=1
[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=mish
[avgpool]
[convolutional]
filters=1000
size=1
stride=1
pad=1
activation=linear
[softmax]
groups=1
@WongKinYiu Do you train on ILSVRC2012? How many iterations do you train before Nan occured? Do you use GPU=1 CUDNN=1 ?
Try to use
[maxpool]
size=2
stride=1
instead of
[maxpool]
size=2
stride=2
padding=1
yes, i train on ILSVRC2012. about 3k~5k iterations i get nan. i use gpu=1 and cudnn=1.
@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?
both of 0.1 and 0.05 get nan. i ll try other setting after finish my breakfast. thanks for ur advice.
@WongKinYiu I'll go through Mish's implementation again in a while and confirm if everything is alright and also give it a try myself to validate the same. Thanks for raising the issue.
https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-551143150 get nan after 300 iterations.
@digantamisra98 thanks, i can also have time to check the implementation after 11/17.
@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?
0.1, 0.05, 0.01, 0.001 all get nan.
get nan after 10 iterations. 0.001
@AlexeyAB @nhaxin204 @WongKinYiu I went through the implementation again and I believe it's correct. Though I'm gonna practically implement it this week. (Sorry, was a bit occupied the last week). I will also ask the fast.ai forum folks to give the implementation a check to make sure I'm not missing anything.
@AlexeyAB This is Tom's response regarding the NaN issue:
That implementation is not at all numerically stable. All the exps quickly lead to overflow and hence NaN. Should be possible to adapt either the Eigen based implementation from tensorflow contrib or my mostly pure C++ implementation (mostly as itβs using the PyTorch dispatch/templating but is otherwise standard C++). The TF one is probably slightly more stable given handling of both underflow and overflow but will require more adaptation to remove the Eigen dependency.
Here is the TensorFlow Addons commit for Mish - https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed
Tom's CUDA implementation - https://github.com/thomasbrandon/mish-cuda
@digantamisra98
So: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31
if (in < threshold) new_in = log( expf(in) );
else new_in= in;
gradient = in * ((1 - tanh(new_in)*tanh(new_in)) * (1 - exp(-new_in))) + tanh(new_in);
delta = delta * gradient;
@digantamisra98 @WongKinYiu @LukeAI @nhaxin204 I fixed MISH to this implementation: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31
thanks, the behavior is normal now.
@WongKinYiu Can you post the log?
@deimsdeutsch
Also as you can see there are 3 different Mish-implementations, even forward-mish functions are different, so we can't convert model between TF(2 thresholds) <-> Pytorch(1 threshold) <-> MXNet (0 thresholds):
your implementation: https://github.com/digantamisra98/Mish/blob/master/Mish/Torch/functional.py#L16
output = input * tanh(log( exp(input) + 1 ))
Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20
if (input < THRESHOLD) output = input * tanh(log( exp(input) ))
else output = input * tanh(input)
if (input > THRESHOLD) output = input * tanh( input ); // too large
else if (input < -THRESHOLD) output = input * tanh( exp(input) ); // too small
else output = input * tanh(log( exp(input) + 1 ));
How do you think to solve this issue?
@AlexeyAB
The answer to that question is discussed in the Google Brain's paper of Swish - https://arxiv.org/pdf/1710.05941v1.pdf
Simple Fully Connected Conv Net.
"To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained."
No Residual Connections were used.
Currently benchmarking on ImageNet.
I'll take a look again and get back to you on that.
@digantamisra98
- Currently benchmarking on ImageNet.
What model do you use for benchmarking on ImageNet? Is it ResNet-101, EfficientNet or MixNet?
@AlexeyAB as of right now, I'm doing for ResNet-56, MobileNet v2, NasNet-A, SEResNet-50, ShuffleNet v1. Currently ShuffleNet is in progress.
@AlexeyAB This is the response that Tom provided in regards to your question of varying thresholds:
The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I donβt think that the differences are big enough that thereβs any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch. They just both come from borrowing the relevant softplus implementation. Iβm not sure the differences make a real impact and wouldnβt prevent converting models, at least not between TF and PyTorch. As noted this would also potentially apply to any model using softplus. If there is indeed no theshold in MXNet then that may cause issues. But this also depends on other details. There may be other handling of non-finite values that would mitigate issues. It also depends on the datatypes used. In general this is mostly an issue for 16-bit floats. Though I think I did see some issues with 32-bit floats I think that was with the quite unstable calculation involving multiple exponents rather than the symbolically derived gradient calculation.
Oh and Iβve responded to that post. Iβd also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred. The one issue is support in older PyTorch versions. It should be fine in PyTorch 1.2 and 1.3 (though Iβve mostly tested in 1.3). I think it should probably also work in 1.1 and maybe even 1.0 in which case it should always be fine as I canβt imagine youβd want to support pre-1.0 anymore. But the JIT version should probably be preferred unless older support is key. Iβd also note that I donβt think my CUDA version will work pre-1.2 so the JIT version should offer equivalent performance and version support. I just need to run a few extra tests on the JIT version and then will likely update the repo to indicate the JIT version should be preferred.
@digantamisra98
The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I donβt think that the differences are big enough that thereβs any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch.
I think there are obviously 2 different MISH-functions, so the weights which are trained on Pytorch can't be used in TF and vice versa. Not only due to 1 vs 2 thresholds, but also due to different formulas - actually different activation-functions:
output = input * tanh(log( exp(input) ))
output = input * tanh(log( exp(input) + 1 ));
Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20
if (input < THRESHOLD) output = input * tanh(log( exp(input) ))
else output = input * tanh(input)
if (input > THRESHOLD) output = input * tanh( input ); // too large
else if (input < -THRESHOLD) output = input * tanh( exp(input) ); // too small
else output = input * tanh(log( exp(input) + 1 ));
Also about thresholds:
Threshold in Pytorch doesn't change activation function much, so it is normal output = input * tanh( input );
~= output = input * tanh(log( exp(input) ))
But the second threshold in TF changes the activation function noticeably at least at some range (may be if input < -THRESHOLD then it doesn't matter) tanh( exp(x) )
!= tanh(ln(exp(x) + 1))
Iβd also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.
What kind of link are you talking about?
@AlexeyAB agreed to the different functional implementation. I guess I'll do PR to change it up. Thanks for clarifying, I completely missed that out. Regarding the comparison between JIT and Autograd I've asked him for further clarification.
@AlexeyAB hello, i train my model using 11/13 repo, and test on ilsvrc 2012 val set.
type | top-1 | top-5 |
---|---|---|
leaky | 70.9 | 90.2 |
swish | 71.7 | 90.8 |
mish | 70.9 | 90.2 |
i find there are some fixes of mish yesterday. do i need retrain mish model using latest repo?
@AlexeyAB the PyTorch implementation by Tom has log1p instead of log which computes log(x+1) and not just log(x) @WongKinYiu can you redirect me to that repository where the code is present to train ImageNet? What model did you use?
@digantamisra98 Yes, you are right. I implemented MISH with 2 thresholds as in TF.
@WongKinYiu Try to train with the latest code. I fixed MISH today: https://github.com/AlexeyAB/darknet/commit/b9ca5ec781291f01174d6b496a9c3ebc59303c1f
@WongKinYiu are you working on training ImageNet currently using the updated Mish implementation?
@digantamisra98
no, i m training res2netlite72. i ll retrain mish model and report results. it will take 1~2 weeks.
@AlexeyAB Mish performs well after be fixed https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-557495489.
Model | Activation | Top-1 | Top-5 |
---|---|---|---|
PeleeNet | LReLU | 70.7 | 90.0 |
PeleeNet | Swish | 71.5 (+0.8) | 90.7 (+0.7) |
PeleeNet | Mish | 71.4 (+0.7) | 90.4 (+0.4) |
CSPPeleeNet | LReLU | 70.9 | 90.2 |
CSPPeleeNet | Swish | 71.7 (+0.8) | 90.8 (+0.6) |
CSPPeleeNet | Mish | 71.2 (+0.3) | 90.3 (+0.1) |
CSPResNeXt-50 | LReLU | 77.9 | 94.0 |
CSPResNeXt-50 | Mish | 78.9 (+1.0) | 94.5 (+0.5) |
CSPResNeXt-50 | Swish | 64.5 (-13.4) | 86.0 (-8.0) |
@WongKinYiu thanks for sharing the result. These are single runs right?
@WongKinYiu Thanks! It seems MISH sometimes isn't is better than SWISH on ImageNet, especially on large models.
@digantamisra98 Are there other MISH tests for ImageNet? Or for recurrent networks (RNN, LSTM, convolutional-LSTM ...) and Transformer/BERT models? As I see ImageNet and Transformer are in the roadmap: https://github.com/digantamisra98/Mish#future-work-coming-soon
@digantamisra98 Yes, I can not afford multiple runs currently. But in my previous experiments, darknet always give me similar results if I use same machine and same setting for training.
@AlexeyAB In my experiments, Mish is more stable than Swish. For ResNeXt-based models, swish can drop more than 10% accuracy on ImageNet.
@AlexeyAB yes, there are a lot of future benchmarks coming in the next updated version of the paper by January. I'm still working on it. Though I'm interested to see the Statistical stability and the CI scores of Swish because so far in my results Mish is much more stable than Swish is as @WongKinYiu just pointed out. So I won't completely rely on single run tests.
What's important to see is the consistency which is just simply the standard deviation of the results. I'm doing those Benchmarks on more standard models like ResNets, SENet, etc. Additionally I am doing intensive mathematical tests to prove its better than Swish not just based on empirical Benchmark scores.
@WongKinYiu Can you show result for CSPResNeXt-50 + Swish?
@AlexeyAB
upated https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-565692356 I trained twice, both get 6x% top-1 acc.
@digantamisra98 In your opinion, what is the reason for appearing Nan during training? Are you planning to somehow modify the MISH activation to avoid Nan? Or is using a Threshold the best solution?
@AlexeyAB I was experiencing NaNs at the very early stage of experimentation. When I adopted the PyTorch Softplus implementation which has a threshold for the Softplus function, I didn't experience NaN errors. I'm guessing there's some numerical stability issue with Softplus. I'm working with few colleagues to optimize Mish to address that problem.
@AlexeyAB additionally, I strongly believe there is something that I guess we haven't figured out with information propagation in increasing depth of networks. This is a very strong point since Mish consistently outperforms Swish when depth increases. I'll plot the residuals of these models and see what's the underlying driver affecting performance.
@WongKinYiu needed some help with ImageNet. Is there someway I can discuss it with you? Thanks!
Mish: π(π₯)=β‘π₯β π‘ππβ(π πππ‘πππ’π (π₯))=β‘π₯β π‘ππβ(lnβ‘(1+ππ₯))
https://arxiv.org/abs/1908.08681
https://github.com/digantamisra98/Mish