Closed chris-ha458 closed 4 years ago
Thanks for raising the issue. Haven't read the paper yet but during my early testing of Mish, I had also tested TanhExp (much before this paper come out, more like in March 2019) along with 4 other variants. TanhExp never made the cut in my broadsuite of experiments, it passed muster in some but was highly inconsistent and unstable also leading to NaN losses in many cases. Also I am not sure if their benchmarks is with Mish's original function definition (xtanh(ln(1+e^x))) or the optimized (not fast) version which is default xtanh(softplus(x)). The reason is the latter (optimized) version is much much more stable and optimized and usually gives better results than the baseline mish definition. This is because of the optimized Softplus layer with it's constraints at 20 to avoid gradient overflow which the baseline definition doesn't offer. Further more, regarding the derivative calculation time, based on just looking at ops, Tanhexp should be faster, however Mish-CUDA is much faster than TanhExp in both forward and backward passes as it has been highly optimized. I will do some manual testing in the upcoming days and profile their function to validate their results. Hopefully, the authors can put up a repository for ease of use. Thanks for the notebook!
I am quite surprised that TanhExp could become inconsistent and unstable. Are you sure you tested the tanhexp as formulated in the paper? y = x * tanh ( exp ( x ) ) Anyway, on a cursory look it seems like it is similar to swish/mish near zero and minus ranges and almost like relu in the positive(above 1) ranges without being piecewise. It is possible for it to be unstable in actual uses, and maybe a deeper analysis could show us why. I would also appreciate any pointers or benchmarks where tanhexp showed instability
I am surprised that the softplus version actually functions different. I checked it up and you are right. But in that case I would like to suggest that you make it clear that the version with softplus isn't actually an alternate formulation based on identity(which should yield same results in principle) but is a different more practical and functional formulation(which may yield different results) This is especially true since it seems while in pytorch softplus has a default threshold that constrains it to 20, it is unclear whether the same behavior is true for tensorflow's softplus(it doesnt seem so to me). In this case, it seems more obvious to me that Mish is more stable since it is not unbounded whereas tanhexp is unbounded in the positive range (which makes it more similar to relu)
In case of Mish - Cuda, would you think that even a potential CUDA enabled tanhexp would be slower than Mish Cuda?
I would love to see tanhexp be properly included as one of the potential comparisons as other activations are in your paper with in depth analyses, but I imagine that takes a lot of time and effort to conduct and more effort to write it up in such a structured form. One of the strong points of Mish is how you took the time to conduct such extensive testing and publishing so it makes comparing other functions with Mish very easy. Even the authors of TanhExp, have not conducted as extensive testing against Mish in the limited regime they argue it is better than Mish. (To their credit, they have done their rigorous testing and validation in their own right, better than most other papers out there. It's just that it pales in comparison to what you have done)
These are the initial variants I had come up with when designing Mish, one of them is TanhExp but I'm unsure which one but I didn't keep record unfortunately of it's results. I might be wrong but as far as I recall it was a bit inconsistent and more sensitive to hyperparameter changes as compared to Mish. I can do a bit of numerical analysis on TanhExp in the coming weeks if time permits.
Yeah, right, it is different. Softplus adds more stability as compared to the baseline implementation. Though based on formula, both are same, upper constraint of 20 is applied only for stability and to avoid gradient overflow. This is prevalent in both PyTorch and TensorFlow (see TensorFlow Addons for Mish).
Not sure, since I don't have much experience with CUDA unfortunately.
I am not sure if I'd be able to do that anytime soon since the tests take a lot of time and resources. Just for context, all the results in this repository/ paper took me +10 months. The reason for such extensive testing was to make sure within my capabilities that Mish performs fairly good or better in mostly all common tasks in Deep Learning. There are many more tasks I could have compared for like GAN, Object Detection, Segmentation, but I currently don't have the resources to do the same. Additionally, I won't pass any judgement on TanhExp's credibility, but I would like to see a repository so that it becomes easier to validate their results. But, thanks for your appreciation on my work.
Through the mish landing page I found this paper TanhExp: A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks Paper
Long story short it is an adjustment on mish. In many of the evaluations that they have done, they show consistent improvement across many benchmarks over mish.
Whether this improvement holds upon other datasets (ImageNet) or networks remain to be seen. The authors definitely have not done as extensive testing as the author of mish has.
I do have problem with the following section "4.6 Comparison of the Computation Speed"
Here they argue that their TanhExp function is faster then mish in its original form(forward pass), 1st derivative(backward pass) and 2nd derivative form.
However, I would like to point out that for at least the original form, they used the original formulation of mish and not the (potentially) faster mish discussed here before.
I have conducted my own tests and found that for CPU calculations, fastmish is faster than tanhexp. Unfortunately, I have not been able to test backward passes. But to the best of my knowledge, the 1st and 2nd derivatives do not differ between the original formulation and the faster formulation. (This is because they are based on an identity and the speed gain comes from CPU friendly formulas)
On GPU, TanhExp was faster than both versions of mish.
Since they do not have a separate github page I decided to at least let you know so that you could raise an objection if need be. I tried thinking of a faster formulation of TanhExp to be fair, but it is beyond my knowledge.
To be frank I am interested in the future of TanhExp since it shows consistent accuracy and training speed and stability gains in the limited scope they tested compared to mish.
However, activation functions come and go real fast(remember swish?) and for simplicity's sake, ReLu isnt going anywhere. Also, previous investigations into this issue showed that minor differences in activation function speed might be eclipsed by other bottlenecks.
Anyway here is my notebook. TanhExp.zip