FluxML / NNlib.jl

Neural Network primitives with multiple backends
Other
198 stars 121 forks source link

I want to implement some activation functions, e.g. NLReLU and ALReLU, maybe even combine (and CReLU?) #528

Open PallHaraldsson opened 10 months ago

PallHaraldsson commented 10 months ago

Motivation and description

Is there such a thing as too many activation functions? I haven't decided on which of these, or if all:

Natural Logarithm rescaled ReLU (NLReLU) https://arxiv.org/pdf/1908.03682.pdf https://arxiv.org/pdf/1808.07325v1.pdf

that one seems interesting, it seems to use the natural logarithm, but I was thinking wouldn't log2 be faster? If you want compatibility, then former; or both?

NLReLU could alleviate the “dying ReLU” and vanishing gradient problems to some extent, thereby improving convergence of neural networks. Experiments show that NLReLU networks can still converge when the learning rate is increased, whereas ReLU networks cannot converge (see Section IV-A).

It's more accurate than swish, ReLU and most they test against, always (and usually against SELU or, if not, comparable). It's likely faster than swish and many good ones, though slower than ReLU (and its leaky variants), but maybe faster than (most) other?

ALReLU: A different approach on Leaky ReLU activation function to improve Neural Networks Performance https://arxiv.org/pdf/2012.07564.pdf

Should be really fast since very simple.

The main difference is that ALReLU has smaller value and derivative. In a theoretically perspective, the ALReLU has also the properties of QReLU, such as the advantage of superposition and entanglement principles. However, this claim is only in theory and it is not proven in this paper

QReLU and m-QReLU: Two novel quantum activation functions to aid medical diagnostics https://arxiv.org/pdf/2010.08031.pdf (see also QIS-Net)

The m-QReLU also satisfies the entanglement principle being derived via the tensor outer product of the solutions from the QReLU. Thus, a quantum-based blend of both superposition and entanglement principles mathematically leads the QReLU and the m-QReLU to obviate the ‘dying ReLU’ problem intrinsically. As shown in (1) and (2), although the two proposed AFs are quantistic in nature, both QReLU and m-QReLU can be run on classical hardware, such as central processing unit (CPU), graphics processing unit (GPU) and tensor processing unit (TPU), the latter being the type of runtime used in this study via Google Colab (http://colab.research.google.com/) to perform the required evaluation on the datasets described in 2.1. [..] Specifically, when using the novel quantum AFs (QReLU and m-QReLU) as compared to the traditional ReLU and Leaky ReLU AFs, the gold standard AFs in DNNs, the following percentage increases in ACC, precision, recall/sensitivity and F1- score were noted: • An increase of 55.32% in ACC and sensitivity/recall via m-QReLU as compared to ReLU and by 37.74% with respect to Leaky ReLU, thus avoiding the ‘dying ReLU’ problem when the CNN was evaluated on the Kaggle Spiral Drawings benchmark dataset (Table 5) • [more such claims]

While about 5x slower ("Computational time It includes both training and evaluation.") than ReLU (and slower than all they compare to such as VLReLU and CReLU), which is strange since they seems very simple, they are always beating ReLU and the leaky variants they test against. Though CReLU beats it in at least one case, and is 0.8/0.47 = 70% more accurate than ReLU, 9.6% better than m-QReLU).

CNNs do not seems a priority (for me), so I'm not sure CReLU is worth it to worry about (I also didn't understand it): Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units https://arxiv.org/pdf/1603.05201v2.pdf

I'm tempted to think the latest, e.g. ALReLU is the best, but I like the logarithmic idea, and maybe it's just overlooked by people. I'm even thinking of combining the two, ALReLU with logarithm on the left, or both left and right. I suppose you would not merge it if non-standard...

I'm looking for the best activation function myself, and I don't want to confuse people here (nor myself) with having a choice of too many. Any idea which of those (or other) are missing here? I also want feature parity with other non-Julia libraries, though only for those very commonly used, but could go with ones that are strictly better if any of the above fit the bill.

This paper presents the ’hyper-sinh’, a variation of the m-arcsinh activation function suitable for Deep Learning (DL)-based algorithms for supervised learning, such as Convolutional Neural Networks (CNN). [..] This function is evaluated with respect to gold standard activation functions, demonstrating its overall competitive accuracy and reliability for both image and text classification.

m-arcsinh seems outdated and its better replacement here: hyper-sinh: An Accurate and Reliable Function from Shallow to Deep Learning in TensorFlow and Keras

https://arxiv.org/pdf/2011.07661.pdf

2.3 hyper-sinh: A reliable activation function for both shallow and deep learning For a function to be generalised as an activation function for both shallow and deep neural networks, such as FC-NN and CNN respectively, it has to be able to 1) avoid common gradient-related issues, such as the vanishing and exploding gradient problems and 2) improve discrimination of input data into target classes via a transfer mechanism of appropriate non-linearity and extended range. Considering the two-fold value of m-arcsinh (Parisi, 2020) as a kernel and activation function concurrently for optimal separating hyperplane- and shallow neural network-based classifiers, it was leveraged as the baseline function to be extended for it to scale to deep neural networks. Thus, although the arcsinh was swapped with its original sinh version, and the square root function was replaced with the basic cubic function, their weights were kept as per the m-arcsinh (Parisi, 2020) equivalent implementation, i.e., whilst 1/3 now multiplies sinh, 1/4 is now multiplying the cubic function.

Thus, the novel function hyper-sinh was devised to be suitable for both shallow and deep neural networks concurrently by leveraging a weighted interaction effect between the hyperbolic nature of the hyperbolic sine function (’sinh’) for positive values and the non-linear characteristic of the cubic function for negative values and 0 (zero), more suitable for deep neural networks, whilst retaining their appropriateness for shallow learning too, thus satisfying both the above-mentioned requirements:

x = sinh x × 1/3, if x > 0

x = x^3×1/4, if x ≤ 0

The derivative of hyper-sinh for positive values can be expressed as: cosh (x) × 1/3

The derivative of hyper-sinh for negative values and 0 (zero) can be expressed as:

x^2 × 3/4

5. Conclusion hyper-sinh was proven an accurate and robust activation function for shallow and deep neural networks for image and text classification, thus being a new gold standard that scales well for FC-NN and CNN.

Feel free to try my modified version:

function similar_to_hypersinh(x)  # Note, NOT a drop-in replacement
  a = x^3/4  # this part is the same, other equation different to get rid of exp(-x), using -x/3 instead:
  x > 1.0499088908562 ? (exp(x) - x*inv(3))*inv(6) : a  # using approximation that's mostly good above that crossover point, and doesn't switch at otherwise correct 0.0
end

https://arxiv.org/pdf/2108.07908.pdf

Possible Implementation

Some (or all) of the above are very simple, I can implement them, at least naively. But I want to tune some of them, so then I don't want to spend too much time on them if they will not be accepted at all.

mcabbott commented 10 months ago

You can of course use any Julia function you like. The main reason to add it to NNlib would be that enough other people want to use it. How widely have these ones been adopted -- perhaps being offered by other frameworks is one source of evidence?

Re performance, the activation functions have been tweaked possibly more than necessary. The big things were avoiding max in favour of ifelse (to be less careful about some Inf, NaN, -0.0 cases) and lower-precision tanh, and avoiding Float64. Perhaps best discussed in a PR for one particular function.

They all have gradient rules, which have not been tweaked as carefully. We don't have a nice to deal with gradient rules for functions with a parameter, e.g. Dense(2=>3, leakyrelu) hits the rule, but Dense(2=>3, x -> leakyrelu(0.1)) uses Zygote's dual number thing.