Open PallHaraldsson opened 10 months ago
You can of course use any Julia function you like. The main reason to add it to NNlib would be that enough other people want to use it. How widely have these ones been adopted -- perhaps being offered by other frameworks is one source of evidence?
Re performance, the activation functions have been tweaked possibly more than necessary. The big things were avoiding max
in favour of ifelse
(to be less careful about some Inf, NaN, -0.0 cases) and lower-precision tanh
, and avoiding Float64. Perhaps best discussed in a PR for one particular function.
They all have gradient rules, which have not been tweaked as carefully. We don't have a nice to deal with gradient rules for functions with a parameter, e.g. Dense(2=>3, leakyrelu)
hits the rule, but Dense(2=>3, x -> leakyrelu(0.1))
uses Zygote's dual number thing.
Motivation and description
Is there such a thing as too many activation functions? I haven't decided on which of these, or if all:
Natural Logarithm rescaled ReLU (NLReLU) https://arxiv.org/pdf/1908.03682.pdf https://arxiv.org/pdf/1808.07325v1.pdf
that one seems interesting, it seems to use the natural logarithm, but I was thinking wouldn't log2 be faster? If you want compatibility, then former; or both?
It's more accurate than swish, ReLU and most they test against, always (and usually against SELU or, if not, comparable). It's likely faster than swish and many good ones, though slower than ReLU (and its leaky variants), but maybe faster than (most) other?
ALReLU: A different approach on Leaky ReLU activation function to improve Neural Networks Performance https://arxiv.org/pdf/2012.07564.pdf
Should be really fast since very simple.
QReLU and m-QReLU: Two novel quantum activation functions to aid medical diagnostics https://arxiv.org/pdf/2010.08031.pdf (see also QIS-Net)
While about 5x slower ("Computational time It includes both training and evaluation.") than ReLU (and slower than all they compare to such as VLReLU and CReLU), which is strange since they seems very simple, they are always beating ReLU and the leaky variants they test against. Though CReLU beats it in at least one case, and is 0.8/0.47 = 70% more accurate than ReLU, 9.6% better than m-QReLU).
CNNs do not seems a priority (for me), so I'm not sure CReLU is worth it to worry about (I also didn't understand it): Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units https://arxiv.org/pdf/1603.05201v2.pdf
I'm tempted to think the latest, e.g. ALReLU is the best, but I like the logarithmic idea, and maybe it's just overlooked by people. I'm even thinking of combining the two, ALReLU with logarithm on the left, or both left and right. I suppose you would not merge it if non-standard...
I'm looking for the best activation function myself, and I don't want to confuse people here (nor myself) with having a choice of too many. Any idea which of those (or other) are missing here? I also want feature parity with other non-Julia libraries, though only for those very commonly used, but could go with ones that are strictly better if any of the above fit the bill.
This paper presents the ’hyper-sinh’, a variation of the m-arcsinh activation function suitable for Deep Learning (DL)-based algorithms for supervised learning, such as Convolutional Neural Networks (CNN). [..] This function is evaluated with respect to gold standard activation functions, demonstrating its overall competitive accuracy and reliability for both image and text classification.
m-arcsinh seems outdated and its better replacement here: hyper-sinh: An Accurate and Reliable Function from Shallow to Deep Learning in TensorFlow and Keras
https://arxiv.org/pdf/2011.07661.pdf
Feel free to try my modified version:
https://arxiv.org/pdf/2108.07908.pdf
Possible Implementation
Some (or all) of the above are very simple, I can implement them, at least naively. But I want to tune some of them, so then I don't want to spend too much time on them if they will not be accepted at all.