Mish and alternatives, including my own

PallHaraldsson commented 4 years ago

First, congratulation on your Mish paper accepted.

I've been thinking about activation functions, and how they can be improved, what properties are most important for a long time, and also inspired by your paper (I noticed the days old update), and e.g. recent SharkFin (my own unpublished idea has some similarities).

I was just thinking if I could ask you some questions, I'm not sure if this is the right place.

It seems like almost any function can do (all except polynominal proven to work for shallow wide networks, and that restriction eliminated by deep-narrow networks):

Universal Approximation with Deep Narrow Networks https://arxiv.org/pdf/1905.08539.pdf

we show that the class of neural networks of arbitrary depth, width n+m+2, and activation function ρ, is dense [..] This covers every activation function possible to use in practice, and also includes polynomial activation functions [..]

We refer to these as enhanced neurons. [..]

4.2. Square Model Lemma 4.3 One layer of two enhanced neurons, with square activation function, may exactly represent the multiplication function (x,y)→xy on R^2 [..]

Remark 4.7 Lemma 4.5 is key to the proof of Proposition 4.6. It was fortunate that the reciprocal function may be approximated by a network of width two - note that even if Proposition 4.6 were already known, it would have required a network of width three. It remains unclear whether an arbitrary-depth network of width two, with square activation function, is dense in C(K). [..]

Remark 4.8 Note that allowing a single extra neuron in each layer would remove the need for the trick with the reciprocal, as it would allow [..] Doing so would dramatically reduce the depth of the network. We are thus paying a heavy price in depth in order to reduce the width by a single neuron.

[I've yet to read much further, but this seems very important.]

So my reading is all but identity function can work as activation function (when more than one hidden layer), and a network no wider than 4 (or 5 better, optimal?) can approximate all four elementary arithmetic operation. Could also approximate e.g. sine and exponential with such narrow network (through Fourier-theorem), I think.

Have you looked at Capsule networks, and deep variant? https://arxiv.org/pdf/1904.09546.pdf

May main worry is that by thinking about better activation functions (of if there can be one best), I'm wasting my time, with them and/or (traditional) backpropagation going away, with the thousand-brain theory and more. Capsule networks seem similar, with a voting mechanism. It at least has ReLU in the first layer (I didn't look at more in detail).

Have you looked at BERT and variants? I assume they could use your functions, or do you know if there are exceptions, making GELU better for them? I'm thinking it's maybe just ignorance (or authors extending, want to change one thing at a time):

https://arxiv.org/pdf/1909.11942.pdf

The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder (Vaswani et al., 2017) with GELU nonlinearities

The Reversible Residual Network: Backpropagation Without Storing Activations https://arxiv.org/pdf/1707.04585.pdf

digantamisra98 commented 4 years ago

Well technically mostly any non-linear function can serve the purpose of being used as an activation function in a neural network, however, for some reasons for which I have only hypotheses and not conclusive evidence, they don't work and cause collapse. This was something I experienced during testing of candidate functions while investigating into similar functions as that of Mish. However, there is a lot yet to understand and uncover. Thank you for the appreciation and for the papers, I haven't gone through some of them yet. So I'll be able to comment better once I do have complete knowledge of those papers.

PallHaraldsson commented 4 years ago

I saw somewhere ANY (continuous) function can work as an activation function assuming, if and only if, it's not a polynominal (i.e. including linear). This is probably theoretical, and in practice some are better than others for training, and some might not even converge.

When I see you report Mesh (best) e.g. 88.15% on CIFAR-10, using SqueezeNet, and other reporting 99% accuracy on that dataset with GELU[2], I think it's about the model, not really saying anything about which function is the best:

https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V1_ICML.pdf

I think this is the code for the paper, and the exact line to change (and lines above): https://github.com/openai/image-gpt/blob/0cb1f2b9a619dc11a4c8b535b8031895aae3ad70/src/model.py#L28

I haven't yet really gotten my hands dirty, running my own, or your or others activation functions, and I was going to ask you how you come up with your ideas, and how you test. It seems you find some model online, and just change the activation function (to be fair to the others you test against). With the code above really short, I should really just try it out, or do you have any advice/links to other better, when you're starting out?

Here's the background (this is brand new, https://openai.com/blog/image-gpt/

PallHaraldsson commented 4 years ago

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence https://arxiv.org/pdf/2001.07685.pdf Despite its simplicity, we show that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks, including 94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 – just 4 labels per class [..] FixMatch, a simpler semi-supervised learning algorithm that achieves state-of-the-art results across many datasets. We also show how Fix-Match can begin to bridge the gap between low-label semi-supervised learning and few-shot learning—or even clustering: we obtain surprisingly-high accuracy with just one label per class.

The Isometric Neural Networks here are very intriguing, I would have though the traditional convolutional (image pyramid) good, not isometric:

https://arxiv.org/pdf/1909.03205.pdf

we introduce isometric architectures. By design, these architectures maintain constant internal resolution throughout all layers (except for the last global pooling). [..] The primary advantage is that using low resolution allows to significantly reduce activation memory foot-print [..] In this paper, we have developed a new way of disentangling neural network internal resolution from the input resolution, and have shown that input resolution plays a fairly minor role in the overall model accuracy. Instead, it is the internal resolution of the hidden layers that are responsible for the impact of resolution multiplier.

Older (beaten by papers above): Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results https://arxiv.org/pdf/1703.01780.pdf

PallHaraldsson commented 4 years ago

Very intriguing I just found (not just for NLP, but also presumably for image models etc. as with the GPT-2 paper above applying to images):

REFORMER: THE EFFICIENT TRANSFORMER https://arxiv.org/pdf/2001.04451.pdf

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L²) to O(L log L) [..] The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences. [on memory:] Reversible layers, first introduced in Gomez et al. (2017), enable storing only a single copy of activations in the whole model, so the N factor disappears.

digantamisra98 commented 4 years ago

Wait, I didn't get this argument: When I see you report Mesh (best) e.g. 88.15% on CIFAR-10, using SqueezeNet, and other reporting 99% accuracy on that dataset with GELU[2], I think it's about the model, not really saying anything about which function is the best:

They're running with different hyper-parameters. All my experiments for comparing activations were run with same parameters which do conclude, confirm and validate which activation is best for that problem using that model.

I think MFT is more conclusive to understand activations than just empirical evidences of them performing better in certain models for certain datasets.

PallHaraldsson commented 4 years ago

I was trying to say the same thing as you are, the higher numbers are due to something else. I thought others use different model (same dataset), could be that and/or hyperparameters.

I would be curious about seeing if some of these other high number should go even higher by just switching to Mish.

digantamisra98 commented 4 years ago

Technically it should. But again I haven't done a hyper-parameter search to find the best parameters to maximize gain for Mish. This is mostly incorporating the use of weight initialization and learning rate. Some initialization strategies really favor ReLU's performance.

PallHaraldsson commented 4 years ago

Technically it should.

Substituting one function for another (e.g. ReLU or sigmoid) assumes the Universal Approximation Theory still holds. I think it does and we shouldn't be too worried. However, even with ReLU or any other function, knowing you could approximate in theory, doesn't mean your network would converge. I'm not sure that's an actual problem for e.g. Mish over some other more established function.

The proof that ReLU works (i.e. is universal), applies to 1 hidden layer, and just to some functions. For some others 2+ hidden needed:

ReLU Deep Neural Networks and Linear Finite Elements https://arxiv.org/pdf/1807.03973.pdf

We theoretically establish that at least 2 hidden layers are needed in a ReLU DNN to represent any linear finite element functions in Ω⊆R^d when d≥2. Consequently, for d= 2,3 which are often encountered in scientific and engineering computing, the minimal number of two hidden layers are necessary and sufficient for any CPWL function to be represented by a ReLU DNN.Then we include a detailed account on how a general CPWL in R^d can be represented by a ReLU DNN with at most log2(d+ 1) hidden layers [..]

we find that the number of neurons that are needed for a DNN to represent a CPWL on m-subdomains can be as large as O(d2^m*m!)! In order to obtain DNN representation with fewer numbers of neurons, in this paper, we consider a special class of CPWL functions, namely the linear finite element (LFE) functions

This paper I just found seems also interesting:

EFFICIENT BI-DIRECTIONAL VERIFICATION OF RELU NETWORKS VIA QUADRATIC PROGRAMMING https://openreview.net/pdf?id=Bkx4AJSFvB

The proof that more hidden layers (than 1 or I guess where 2+ are needed) are ok, is is by having all those additional layers approximate linear activation functions only, i.e. disable those extra layers, so it's important to be able to approximate the identity function perfectly. And it seems curvy functions are worse for that, e.g. your Mish.

I doubt a network with other functions, e.g. Mish needs more layers, but it seems like those extra layers would need to be wider, even exponentially wider? If not absent.

Also interesting:

Provable approximation properties for deep neural networks https://www.sciencedirect.com/science/article/pii/S1063520316300033

4.1. Constructing a wavelet frame from rectifier units

In this section we show how Rectified Linear Units (ReLU) can be used to obtain a wavelet frame of L2(R^n). The construction of wavelets from rectifiers is fairly simple, and we refer to results from Section 3.2 to show [..]

The rectifier activation function is defined on R as

rect(x) = max(0, x)

We define a trapezoid-shaped function by

t(x) = rect(x +3) - rect(x + 1) - rect(x - 1) + rect(x -3)

We then define the scaling function

PallHaraldsson commented 4 years ago

Intriguing (and much better than XNOR-Net, and other binary), curious how would work with your function:

https://arxiv.org/pdf/1605.04711.pdf

We introduce ternary weight networks (TWNs) - neural networks with weights constrained to +1, 0 and -1. [..] Besides, a threshold-based ternary function is optimized to get an approximated solution which can be fast and easily computed. TWNs have stronger expressive abilities than recently proposed binary precision counterparts and are more effective than the latter. Meanwhile, TWNs achieve up to 16× or 32× model compression rate and need fewer multiplications compared with the full precision counterparts. Benchmarks on MNIST, CIFAR-10, and large scale ImageNet datasets show that the performance of TWNs is only slightly worse than the full precision counterparts but outperforms the analogous binary precision counterparts a lot. [..] Table. 2 summarizes the overall benchmark results with the previous settings. On the small scale datasets (MNIST and CIFAR-10), TWNs achieve state-of-the-art performance as FPWNs, while beat BPWNs a lot.

PallHaraldsson commented 4 years ago

It's a bit puzzling that ternary (or binary) networks are this good. It's not obvious that they are still universal approximators, e.g. if you wanted +2 or +3 as a weight you would need 2 or 3 neurons etc. instead of one, but e.g. weight 0.5 or 1.5 seems problematic (maybe deeper networks help with this?). For a recurrent (Turing-complete) network they would be, but not for feed-forward?

However, binary is faster then full precision (and I guess ternary too, half as fast as this?): https://arxiv.org/pdf/1602.02830.pdf

we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. [..] So far, to the best of our knowledge, no work has succeeded in binarizing weights and neurons, at the inference phase and the entire training phase of a deep network. [..] This was previously done for the weights by Courbariaux et al. (2015), but our BNNs extend this to the activations.

digantamisra98 commented 4 years ago

Consider reading ReActNet from ECCV 20. Also this discussion is very broad and since there isn't any issue regarding the repository, I'd like to close this thread and rather would request you to move it to the dedicated Fast.AI forum for Mish. Link - https://forums.fast.ai/t/meet-mish-new-activation-function-possible-successor-to-relu/53299

digantamisra98 / Mish

Mish and alternatives, including my own #35