Closed tranhungnghiep closed 5 years ago
@tranhungnghiep Hi. All the benchmarks in this repository are pertaining to SiLU itself considering the Beta values for Swish are set to 1 itself which translates to the formula: input * sigmoid(input) which is the SiLU formula itself.
@digantamisra98 I see. So it makes even more sense now to state clearly the baseline is SiLU. Note that Swish is different in that the scale factor beta is learnable. Maybe you will find it interesting to compare against 2 baselines, SiLU and Swish with learnable beta. I'm very curious in what contribute to a good activation function and why so.
@tranhungnghiep agreed. But a small correction to your statement. According to the Swish paper, baseline Swish is when beta = 1, their paper clearly mentions it when they showcase how to implement it on TensorFlow which is: x * tf.sigmoid(x).
Although Swish by Google researchers is more popular, it is the result of NAS and not very well-justified. Actually Swish is a modification of the earlier activation function SiLU (Sigmoid-weighted linear units), which shows better results than Swish on several benchmarks and simpler conceptually. Moreover, SiLU is more theoretical justified, for example the idea of self-regularization was proposed. Please see the following paper.
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118, 2017.