convolutional has no temporal dependency compare to recurrent models
recurrent models have infinite contexts but paper's experiment shows it is not necessary
GLU
abstract model
model uses adaptive softmax which assign higher capacity to very frequent words and lower capacity to rare words
this results faster coputation and needs lower memory
3. Gating Mechanisms
purpose of gating mechanism is to control what information should be propagated through the hierarchy of layers
comparing to GTU (LSTM-style gating mechanism), gradient of GTU gradually vanishes because of downscaling factors tanh'(X) and \sigma'(X) but GLU doesn't have downscaling factor
this can be thought of as a multiplicative skip connection (which helps gradients flow through the layers)
4. Experimental Setup
4.2. Training
uses gradient clipping on training and it works well
https://arxiv.org/abs/1612.08083
Abstract
1. Introduction
2. Approach
3. Gating Mechanisms
4. Experimental Setup
4.2. Training
4.3. Hyper-parameters
5. Results
5.3. Non-linear Modeling
TODO