LiyuanLucasLiu Transformer-Clinic issues

LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers

https://arxiv.org/abs/2004.08249

Apache License 2.0

326 stars 20 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Position of residual connection in PreLN architecture is wrong

#27 bilzard closed 1 year ago
1
How to get the beta_{i,j} for each residual branch?

#26 SefaZeng opened 2 years ago
0
Deepnet

#25 LiyuanLucasLiu closed 2 years ago
0
Admin for 100L-100L model？

#24 Vincent131499 closed 2 years ago
1
Ensemble models

#23 Vincent131499 closed 2 years ago
0
How to add Radam to fairseq ?

#22 KelleyYin closed 3 years ago
1
argdict

#21 riosempre closed 3 years ago
1
Reimplement Admin in new fairseq but get bad valid loss

#20 moonscar closed 3 years ago
0
Question about the adaptive optimizer

#19 chenwydj closed 3 years ago
1
Difference of implementation from the original paper

#18 wade3han closed 3 years ago
1
`RuntimeError: expected scalar type Float but found Half` during the eval step

#17 ruiningh closed 3 years ago
5
Scripts for Post-LN in Figure 10?

#16 zhuchen03 closed 3 years ago
1
Is wmt14en-fr.sh missing in pre-process dir?

#15 lvzaihefang closed 3 years ago
1
wmt_en_de admin: Function 'SoftmaxBackward' returned nan values in its 0th output.

#14 sshleifer closed 3 years ago
8
tmp_weight is not defined

#13 sshleifer closed 3 years ago
4
IWSLT'14 Results

#12 villmow closed 3 years ago
1
Update README.md

#10 LiyuanLucasLiu closed 4 years ago
0
Post-LN with 12-12 is trained ok, but 12-3 diverge

#9 ZhenYangIACAS closed 4 years ago
9
How to make sure that only performing one step forward pass in profiling phase?

#8 ZhenYangIACAS closed 4 years ago
1
is "tmp_weight" in transformer_layer.py useless?

#7 zherowolf closed 4 years ago
3
Details of total batch size

#6 luofuli closed 4 years ago
1
Do the embedding layer's layernorm parameters need to be reparameterized accordingly?

#5 gotobelieve closed 4 years ago
1
Can I use a pre-trained model to initialize the model?

#4 luofuli closed 4 years ago
1
Is the "attention_ratio_change" and "fc_ratio_change" trainable or not?

#3 gotobelieve closed 4 years ago
2
remove debug parameter

#2 LiyuanLucasLiu closed 4 years ago
0
whta's the meaning of 'adaptive-scale' argument?

#1 gotobelieve closed 4 years ago
1