ICLR-2019-Fixup Initialization: Residual Learning Without Normalization

一句话总结：

Normalization layers 已经成为了现在主流DL模型的标准组件。大家普遍认为Normalization layers 可以让训练更稳定，使用更高的学习率，加快收敛，并提高泛化能力。这篇文章是为了挑战这一观念，通过展示了上面这些有点并不仅仅是通过Normalization带来的。具体点说，我们使用了一个固定的初始化值，然后我们发现如果使用fixup的初始化值的话，及时将residual networks加到10000层，训练也很稳定。另外，如果加上合适的正则化，fixup可以让residual networks在不使用normalization的情况下，实现SOTA，在image classication和mahicne traslation。

资源：

pdf
code
[paper-with-code](

论文信息：

Author: MIT，Google Brain，Stanford University
Dataset:
keywords:

笔记：

4.3 MACHINE TRANSLATION

To demonstrate the generality of Fixup, we also apply it to replace layer normalization (Ba et al., 2016) in Transformer (Vaswani et al., 2017), a state-of-the-art neural network for machine translation. Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup template in Section 3 to modify the baseline model

We evaluate on two standard machine translation datasets, IWSLT German-English (de-en) and WMT English-German (en-de) following the setup of Ott et al. (2018).

It was reported (Chen et al., 2018) that “Layer normalization is most critical to stabilize the training process... removing layer normalization results in unstable training runs”. However we find training with Fixup to be very stable and as fast as the baseline model. Results are shown in Table 3.

模型图：

结果：

接下来要看的论文：

BrambleXu / knowledge-graph-learning

ICLR-2019-Fixup Initialization: Residual Learning Without Normalization #241