Closed NorbertZheng closed 1 year ago
Gaussian Error Linear Units (GELUs). GELU, by University of California, and Toyota Technological Institute at Chicago. 2016 arXiv, Over 600 Citations.
Activation Unit, Image Classification, POS Tagging, Phone Recognition.
GELU (μ=0, σ=0) vs ReLU vs ELU.
Specifically, the neuron input x can be multiplied by $m\sim Bernoulli(\Phi(x))$, where $\Phi(x) = P(X\leq x)$; $X\sim N(0, 1)$ is the cumulative distribution function of the standard normal distribution.
Since the cumulative distribution function of a Gaussian is often computed with the error function, the Gaussian Error Linear Unit (GELU) is defined as:
The above equation is approximated as:
or:
if greater feedforward speed is worth the cost of exactness.
Different $N(\mu, \sigma)$ can be used as CDF, but in this paper, $N(0,1)$ is used.
MNIST Classification Results. Left are the loss curves without Dropout, and right are curves with a Dropout rate of 0.5.
GELU tends to have the lowest median training log loss with and without Dropout.
MNIST Autoencoding Results.
GELU accommodates different learning rates and significantly outperforms the other nonlinearities.
After five runs per setting, median test error chosen at the lowest validation error is 29.3% for the GELU, 29.5% for the ReLU, and 29.6% for the ELU.
CIFAR-10 Results.
CIFAR-100 Results.
On CIFAR-10, ultimately, the GELU obtains a median error rate of 7.89%, the ReLU obtains 8.16%, and the ELU obtains 8.41%.
On CIFAR-100, the GELU achieves a median error of 20.74%, the ReLU obtains 21.77%, and the ELU obtains 22.98%.
Sik-Ho Tang. Review — Gaussian Error Linear Units (GELUs).