[Feature Request] Mish Activation function +0.5 AP wrt. Swish

LukeAI commented 5 years ago

Mish: 𝑓(𝑥)=⁡𝑥⋅𝑡𝑎𝑛ℎ(𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠(𝑥))=⁡𝑥⋅𝑡𝑎𝑛ℎ(ln⁡(1+𝑒𝑥))

https://arxiv.org/abs/1908.08681

AlexeyAB commented 5 years ago

The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.

digantamisra98 commented 5 years ago

@AlexeyAB Was wondering if you're considering adding up Mish? In that regards, based on the above screenshot there is a mistake in the derivative formula which I have updated in my paper. The link to the updated paper and additional results are in my repository here - https://github.com/digantamisra98/Mish Thanks!

AlexeyAB commented 5 years ago

@digantamisra98

I added MISH-activation. use activation=mish in [convolutional] layers

Please, check that implementation is correct: https://github.com/AlexeyAB/darknet/commit/bf8ea4183dc265ac17f7c9d939dc815269f0a213

Thanks! So the error was in delta?

digantamisra98 commented 5 years ago

Just checked, The implementation is correct. Thanks. Yes the error was a typo in the delta term.

WongKinYiu commented 5 years ago

now training.

WongKinYiu commented 5 years ago

now training.

usually get nan, do i need adjust learning rate schedule?

burn_in=2000
learning_rate=0.1
policy=poly
power=4
max_batches=1600000

AlexeyAB commented 5 years ago

@WongKinYiu What model do you try to train?

WongKinYiu commented 5 years ago

densenet based model. i ll try darknet based model first.

digantamisra98 commented 5 years ago

@WongKinYiu Here's the DenseNet code I used to test Mish - https://github.com/digantamisra98/Mish/blob/master/Notebooks/cifar-10-DenseNet121_Mish.ipynb Usually I'll advise to have a lower learning rate probably 1e-3 (0.01 - 0.001) Can you share the log maybe or the code to reproduce the NaN?

WongKinYiu commented 5 years ago

darknet based model also get nan.

[net]
batch=128
subdivisions=1
height=224
width=224
channels=3
momentum=0.9
decay=0.0005
max_crop=320

learning_rate=0.1
policy=poly
power=4
max_batches=1600000

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=mish

[maxpool]
size=2
stride=2
padding=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=mish

[avgpool]

[convolutional]
filters=1000
size=1
stride=1
pad=1
activation=linear

[softmax]
groups=1

AlexeyAB commented 5 years ago

@WongKinYiu Do you train on ILSVRC2012? How many iterations do you train before Nan occured? Do you use GPU=1 CUDNN=1 ?

Try to use

[maxpool]
size=2
stride=1

instead of

[maxpool]
size=2
stride=2
padding=1

WongKinYiu commented 5 years ago

yes, i train on ILSVRC2012. about 3k~5k iterations i get nan. i use gpu=1 and cudnn=1.

AlexeyAB commented 5 years ago

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

WongKinYiu commented 5 years ago

both of 0.1 and 0.05 get nan. i ll try other setting after finish my breakfast. thanks for ur advice.

digantamisra98 commented 5 years ago

@WongKinYiu I'll go through Mish's implementation again in a while and confirm if everything is alright and also give it a try myself to validate the same. Thanks for raising the issue.

WongKinYiu commented 5 years ago

https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-551143150 get nan after 300 iterations.

WongKinYiu commented 5 years ago

@digantamisra98 thanks, i can also have time to check the implementation after 11/17.

WongKinYiu commented 5 years ago

@WongKinYiu Did you try to use initial learning_rate=0.01 or 0.001 ?

0.1, 0.05, 0.01, 0.001 all get nan.

nhaxin204 commented 5 years ago

get nan after 10 iterations. 0.001

digantamisra98 commented 4 years ago

@AlexeyAB @nhaxin204 @WongKinYiu I went through the implementation again and I believe it's correct. Though I'm gonna practically implement it this week. (Sorry, was a bit occupied the last week). I will also ask the fast.ai forum folks to give the implementation a check to make sure I'm not missing anything.

digantamisra98 commented 4 years ago

@AlexeyAB This is Tom's response regarding the NaN issue:

That implementation is not at all numerically stable. All the exps quickly lead to overflow and hence NaN. Should be possible to adapt either the Eigen based implementation from tensorflow contrib or my mostly pure C++ implementation (mostly as it’s using the PyTorch dispatch/templating but is otherwise standard C++). The TF one is probably slightly more stable given handling of both underflow and overflow but will require more adaptation to remove the Eigen dependency.

Here is the TensorFlow Addons commit for Mish - https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed

Tom's CUDA implementation - https://github.com/thomasbrandon/mish-cuda

AlexeyAB commented 4 years ago

@digantamisra98

So: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

if (in < threshold) new_in = log( expf(in) );
else new_in= in;

gradient = in * ((1 - tanh(new_in)*tanh(new_in)) * (1 - exp(-new_in))) + tanh(new_in);

delta = delta * gradient;

AlexeyAB commented 4 years ago

@digantamisra98 @WongKinYiu @LukeAI @nhaxin204 I fixed MISH to this implementation: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L26-L31

WongKinYiu commented 4 years ago

thanks, the behavior is normal now.

digantamisra98 commented 4 years ago

@WongKinYiu Can you post the log?

WongKinYiu commented 4 years ago

chart

AlexeyAB commented 4 years ago

@deimsdeutsch

Why does accuracy decrease with increasing number of layers?
What model did you use?
Did you use Residual-connections or Concatenate-layers in this model?
Did you test Mish on big dataset and models?
Also as you can see there are 3 different Mish-implementations, even forward-mish functions are different, so we can't convert model between TF(2 thresholds) <-> Pytorch(1 threshold) <-> MXNet (0 thresholds):
1. your implementation: https://github.com/digantamisra98/Mish/blob/master/Mish/Torch/functional.py#L16 output = input * tanh(log( exp(input) + 1 ))
2. Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20
```
if (input < THRESHOLD) output  = input * tanh(log( exp(input) ))
else output  = input * tanh(input)
```
3. TF: https://github.com/tensorflow/addons/commit/093cdfa85d334cbe19a37624c33198f3140109ed#diff-ba79ea22df25d0228e4581894a324095R40-R49
```
if (input > THRESHOLD) output  = input * tanh( input );      // too large
else if (input < -THRESHOLD) output  = input * tanh( exp(input) );    // too small
else output  = input * tanh(log( exp(input) + 1 ));
```

How do you think to solve this issue?

69233007-1acdca80-0bb2-11ea-8a79-ced9c0b5780c

digantamisra98 commented 4 years ago

@AlexeyAB

The answer to that question is discussed in the Google Brain's paper of Swish - https://arxiv.org/pdf/1710.05941v1.pdf
Simple Fully Connected Conv Net.

"To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained."

No Residual Connections were used.
Currently benchmarking on ImageNet.
I'll take a look again and get back to you on that.

AlexeyAB commented 4 years ago

@digantamisra98

Currently benchmarking on ImageNet.

What model do you use for benchmarking on ImageNet? Is it ResNet-101, EfficientNet or MixNet?

digantamisra98 commented 4 years ago

@AlexeyAB as of right now, I'm doing for ResNet-56, MobileNet v2, NasNet-A, SEResNet-50, ShuffleNet v1. Currently ShuffleNet is in progress.

digantamisra98 commented 4 years ago

@AlexeyAB This is the response that Tom provided in regards to your question of varying thresholds:

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch. They just both come from borrowing the relevant softplus implementation. I’m not sure the differences make a real impact and wouldn’t prevent converting models, at least not between TF and PyTorch. As noted this would also potentially apply to any model using softplus. If there is indeed no theshold in MXNet then that may cause issues. But this also depends on other details. There may be other handling of non-finite values that would mitigate issues. It also depends on the datatypes used. In general this is mostly an issue for 16-bit floats. Though I think I did see some issues with 32-bit floats I think that was with the quite unstable calculation involving multiple exponents rather than the symbolically derived gradient calculation.

Oh and I’ve responded to that post. I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred. The one issue is support in older PyTorch versions. It should be fine in PyTorch 1.2 and 1.3 (though I’ve mostly tested in 1.3). I think it should probably also work in 1.1 and maybe even 1.0 in which case it should always be fine as I can’t imagine you’d want to support pre-1.0 anymore. But the JIT version should probably be preferred unless older support is key. I’d also note that I don’t think my CUDA version will work pre-1.2 so the JIT version should offer equivalent performance and version support. I just need to run a few extra tests on the JIT version and then will likely update the repo to indicate the JIT version should be preferred.

AlexeyAB commented 4 years ago

@digantamisra98

The differences between PyTorch and TF reflect slight differences in their implementations of softplus. The single threshold in my CUDA version reflects the PyTorch logic. I don’t think that the differences are big enough that there’s any strong reason to use the same implementation so think you could just as well use the TF logic for Mish in PyTorch.

I think there are obviously 2 different MISH-functions, so the weights which are trained on Pytorch can't be used in TF and vice versa. Not only due to 1 vs 2 thresholds, but also due to different formulas - actually different activation-functions:

Pytorch: output = input * tanh(log( exp(input) ))
TF: output = input * tanh(log( exp(input) + 1 ));

Pytorch: https://github.com/thomasbrandon/mish-cuda/blob/master/csrc/mish.h#L17-L20

if (input < THRESHOLD) output  = input * tanh(log( exp(input) ))
else output  = input * tanh(input)

TF: https://github.com/tensorflow/addons/blob/093cdfa85d334cbe19a37624c33198f3140109ed/tensorflow_addons/custom_ops/activations/cc/kernels/mish_op.h#L40-L49


if (input > THRESHOLD) output  = input * tanh( input );      // too large
else if (input < -THRESHOLD) output  = input * tanh( exp(input) );    // too small
else output  = input * tanh(log( exp(input) + 1 ));

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGxuKGV4cCh4KSsxKSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkpKSIsImNvbG9yIjoiIzAwMDAwMCJ9LHsidHlwZSI6MTAwMH1d

Also about thresholds:

Threshold in Pytorch doesn't change activation function much, so it is normal output = input * tanh( input ); ~= output = input * tanh(log( exp(input) ))
But the second threshold in TF changes the activation function noticeably at least at some range (may be if input < -THRESHOLD then it doesn't matter) tanh( exp(x) ) != tanh(ln(exp(x) + 1))

http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ0YW5oKGV4cCh4KSkiLCJjb2xvciI6IiMwMDAwMDAifSx7InR5cGUiOjAsImVxIjoidGFuaChsbihleHAoeCkrMSkpIiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwfV0-

I’d also note that you pointed to the Autograd implementation which should reduce memory usage but will result in lower performance. The JIT version combines both the lower memory usage and better performance so should generally be preferred.

What kind of link are you talking about?

digantamisra98 commented 4 years ago

@AlexeyAB agreed to the different functional implementation. I guess I'll do PR to change it up. Thanks for clarifying, I completely missed that out. Regarding the comparison between JIT and Autograd I've asked him for further clarification.

WongKinYiu commented 4 years ago

@AlexeyAB hello, i train my model using 11/13 repo, and test on ilsvrc 2012 val set.

type	top-1	top-5
leaky	70.9	90.2
swish	71.7	90.8
mish	70.9	90.2

i find there are some fixes of mish yesterday. do i need retrain mish model using latest repo?

digantamisra98 commented 4 years ago

@AlexeyAB the PyTorch implementation by Tom has log1p instead of log which computes log(x+1) and not just log(x) @WongKinYiu can you redirect me to that repository where the code is present to train ImageNet? What model did you use?

AlexeyAB commented 4 years ago

@digantamisra98 Yes, you are right. I implemented MISH with 2 thresholds as in TF.

@WongKinYiu Try to train with the latest code. I fixed MISH today: https://github.com/AlexeyAB/darknet/commit/b9ca5ec781291f01174d6b496a9c3ebc59303c1f

digantamisra98 commented 4 years ago

@WongKinYiu are you working on training ImageNet currently using the updated Mish implementation?

WongKinYiu commented 4 years ago

@digantamisra98

no, i m training res2netlite72. i ll retrain mish model and report results. it will take 1~2 weeks.

WongKinYiu commented 4 years ago

@AlexeyAB Mish performs well after be fixed https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-557495489.

Model	Activation	Top-1	Top-5
PeleeNet	LReLU	70.7	90.0
PeleeNet	Swish	71.5 (+0.8)	90.7 (+0.7)
PeleeNet	Mish	71.4 (+0.7)	90.4 (+0.4)

CSPPeleeNet	LReLU	70.9	90.2
CSPPeleeNet	Swish	71.7 (+0.8)	90.8 (+0.6)
CSPPeleeNet	Mish	71.2 (+0.3)	90.3 (+0.1)

CSPResNeXt-50	LReLU	77.9	94.0
CSPResNeXt-50	Mish	78.9 (+1.0)	94.5 (+0.5)
CSPResNeXt-50	Swish	64.5 (-13.4)	86.0 (-8.0)

digantamisra98 commented 4 years ago

@WongKinYiu thanks for sharing the result. These are single runs right?

AlexeyAB commented 4 years ago

@WongKinYiu Thanks! It seems MISH sometimes ~~isn't~~ is better than SWISH on ImageNet, especially on large models.

@digantamisra98 Are there other MISH tests for ImageNet? Or for recurrent networks (RNN, LSTM, convolutional-LSTM ...) and Transformer/BERT models? As I see ImageNet and Transformer are in the roadmap: https://github.com/digantamisra98/Mish#future-work-coming-soon

WongKinYiu commented 4 years ago

@digantamisra98 Yes, I can not afford multiple runs currently. But in my previous experiments, darknet always give me similar results if I use same machine and same setting for training.

@AlexeyAB In my experiments, Mish is more stable than Swish. For ResNeXt-based models, swish can drop more than 10% accuracy on ImageNet.

digantamisra98 commented 4 years ago

@AlexeyAB yes, there are a lot of future benchmarks coming in the next updated version of the paper by January. I'm still working on it. Though I'm interested to see the Statistical stability and the CI scores of Swish because so far in my results Mish is much more stable than Swish is as @WongKinYiu just pointed out. So I won't completely rely on single run tests.

digantamisra98 commented 4 years ago

What's important to see is the consistency which is just simply the standard deviation of the results. I'm doing those Benchmarks on more standard models like ResNets, SENet, etc. Additionally I am doing intensive mathematical tests to prove its better than Swish not just based on empirical Benchmark scores.

AlexeyAB commented 4 years ago

@WongKinYiu Can you show result for CSPResNeXt-50 + Swish?

WongKinYiu commented 4 years ago

@AlexeyAB

upated https://github.com/AlexeyAB/darknet/issues/3994#issuecomment-565692356 I trained twice, both get 6x% top-1 acc.

AlexeyAB commented 4 years ago

@digantamisra98 In your opinion, what is the reason for appearing Nan during training? Are you planning to somehow modify the MISH activation to avoid Nan? Or is using a Threshold the best solution?

digantamisra98 commented 4 years ago

@AlexeyAB I was experiencing NaNs at the very early stage of experimentation. When I adopted the PyTorch Softplus implementation which has a threshold for the Softplus function, I didn't experience NaN errors. I'm guessing there's some numerical stability issue with Softplus. I'm working with few colleagues to optimize Mish to address that problem.

digantamisra98 commented 4 years ago

@AlexeyAB additionally, I strongly believe there is something that I guess we haven't figured out with information propagation in increasing depth of networks. This is a very strong point since Mish consistently outperforms Swish when depth increases. I'll plot the residuals of these models and see what's the underlying driver affecting performance.

digantamisra98 commented 4 years ago

@WongKinYiu needed some help with ImageNet. Is there someway I can discuss it with you? Thanks!

AlexeyAB / darknet

[Feature Request] Mish Activation function +0.5 AP wrt. Swish #3994