flrngel / understanding-ai

personal repository

36 stars 6 forks source link

Neural Machine Translation in Linear Time #11

Open flrngel opened 6 years ago

flrngel commented 6 years ago

https://arxiv.org/abs/1610.10099 aka ByteNet paper from Deepmind

Notations

s: source
t: target

Abstract

Features

(model feature) stacking decoder on top of encoder
(training feature) decoder using dynamically unfolding mechanism
(result feature) linear in sequence length, side steps excessive memorization

1. Introduction

ByteNet is resolution preserving
- side steps memorization and allows maximal bandwidth between encoder and decoder

2. Neural Translation Model

2.1. Desiderata

(Desiderata is latin word of disideratum, which means model's goal in this paper)

run in parallel (reducing computation time)
resolution preserving with no constant size
path between input and output has to be short

3. ByteNet

3.1. Encoder-Decoder Stacking

decoder is on top of encoder because to maximize the representational bandwidth

3.2. Dynamic Unfolding

finding length |t| with Linear equation (a=1.2, b=0 in this paper)

3.4. Masked One-dimensional Convolutions

use masking to prevent future tokens not to affect current token

3.5. Dilation

dilation makes receptive field grow exponentially

4. Model Comparison

Todo

Search about dilated convolution