flrngel / understanding-ai

personal repository

36 stars 6 forks source link

Language Modeling with Gated Convolutional Networks #16

Open flrngel opened 6 years ago

flrngel commented 6 years ago

https://arxiv.org/abs/1612.08083

Abstract

propose gating mechanism
uses WikiText-103 and Google Billion Words
proposed model is very competitive to strong recurrent models on large scale language tasks

1. Introduction

convolutional network has parallelization benefit
- but cuDNN is not optimized for 1d convolutions yet
Gated linear solves vanishing gradient
Compare to PixcelCNN, Oord et al. 2016, GLU is better than LSTM-style gating (GTU)

2. Approach

convolutional has no temporal dependency compare to recurrent models
recurrent models have infinite contexts but paper's experiment shows it is not necessary
GLU
abstract model
model uses adaptive softmax which assign higher capacity to very frequent words and lower capacity to rare words
- this results faster coputation and needs lower memory

3. Gating Mechanisms

purpose of gating mechanism is to control what information should be propagated through the hierarchy of layers
comparing to GTU (LSTM-style gating mechanism), gradient of GTU gradually vanishes because of downscaling factors tanh'(X) and \sigma'(X) but GLU doesn't have downscaling factor
- this can be thought of as a multiplicative skip connection (which helps gradients flow through the layers)

4. Experimental Setup

4.2. Training

uses gradient clipping on training and it works well

4.3. Hyper-parameters

initialize layers with Kaiming initialization

5. Results

5.3. Non-linear Modeling

Bilinear layers + GLU performs best

TODO

read http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf