issues
search
LiyuanLucasLiu
/
Transformer-Clinic
Understanding the Difficulty of Training Transformers
https://arxiv.org/abs/2004.08249
Apache License 2.0
326
stars
20
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Position of residual connection in PreLN architecture is wrong
#27
bilzard
closed
1 year ago
1
How to get the beta_{i,j} for each residual branch?
#26
SefaZeng
opened
2 years ago
0
Deepnet
#25
LiyuanLucasLiu
closed
2 years ago
0
Admin for 100L-100L model?
#24
Vincent131499
closed
2 years ago
1
Ensemble models
#23
Vincent131499
closed
2 years ago
0
How to add Radam to fairseq ?
#22
KelleyYin
closed
3 years ago
1
argdict
#21
riosempre
closed
3 years ago
1
Reimplement Admin in new fairseq but get bad valid loss
#20
moonscar
closed
3 years ago
0
Question about the adaptive optimizer
#19
chenwydj
closed
3 years ago
1
Difference of implementation from the original paper
#18
wade3han
closed
3 years ago
1
`RuntimeError: expected scalar type Float but found Half` during the eval step
#17
ruiningh
closed
3 years ago
5
Scripts for Post-LN in Figure 10?
#16
zhuchen03
closed
3 years ago
1
Is wmt14en-fr.sh missing in pre-process dir?
#15
lvzaihefang
closed
3 years ago
1
wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output.
#14
sshleifer
closed
3 years ago
8
tmp_weight is not defined
#13
sshleifer
closed
3 years ago
4
IWSLT'14 Results
#12
villmow
closed
3 years ago
1
Update README.md
#10
LiyuanLucasLiu
closed
4 years ago
0
Post-LN with 12-12 is trained ok, but 12-3 diverge
#9
ZhenYangIACAS
closed
4 years ago
9
How to make sure that only performing one step forward pass in profiling phase?
#8
ZhenYangIACAS
closed
4 years ago
1
is "tmp_weight" in transformer_layer.py useless?
#7
zherowolf
closed
4 years ago
3
Details of total batch size
#6
luofuli
closed
4 years ago
1
Do the embedding layer's layernorm parameters need to be reparameterized accordingly?
#5
gotobelieve
closed
4 years ago
1
Can I use a pre-trained model to initialize the model?
#4
luofuli
closed
4 years ago
1
Is the "attention_ratio_change" and "fc_ratio_change" trainable or not?
#3
gotobelieve
closed
4 years ago
2
remove debug parameter
#2
LiyuanLucasLiu
closed
4 years ago
0
whta's the meaning of 'adaptive-scale' argument?
#1
gotobelieve
closed
4 years ago
1