SiLU is a more relevant baseline than Swish

digantamisra98 / Mish

Official Repository for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

https://www.bmvc2020-conference.com/assets/papers/0928.pdf

MIT License

1.29k stars 130 forks source link

SiLU is a more relevant baseline than Swish #8

Closed tranhungnghiep closed 5 years ago

tranhungnghiep commented 5 years ago

Although Swish by Google researchers is more popular, it is the result of NAS and not very well-justified. Actually Swish is a modification of the earlier activation function SiLU (Sigmoid-weighted linear units), which shows better results than Swish on several benchmarks and simpler conceptually. Moreover, SiLU is more theoretical justified, for example the idea of self-regularization was proposed. Please see the following paper.

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118, 2017.

digantamisra98 commented 5 years ago

@tranhungnghiep Hi. All the benchmarks in this repository are pertaining to SiLU itself considering the Beta values for Swish are set to 1 itself which translates to the formula: input * sigmoid(input) which is the SiLU formula itself.

tranhungnghiep commented 5 years ago

@digantamisra98 I see. So it makes even more sense now to state clearly the baseline is SiLU. Note that Swish is different in that the scale factor beta is learnable. Maybe you will find it interesting to compare against 2 baselines, SiLU and Swish with learnable beta. I'm very curious in what contribute to a good activation function and why so.

digantamisra98 commented 5 years ago

@tranhungnghiep agreed. But a small correction to your statement. According to the Swish paper, baseline Swish is when beta = 1, their paper clearly mentions it when they showcase how to implement it on TensorFlow which is: x * tf.sigmoid(x).