Closed jiezhangGt closed 5 years ago
Hi, thanks for your interest. For IWSLT, we did not find GLU helpful. Which dataset are you using? We theorize it is because it's not necessary to have that additional capacity.
Also, we found in our experiments that with GLU, models can overfit, so we used larger dropout compared to models without GLU. So I would suggest tuning the dropout and l2 regularization if you try with GLU.
We also have pretrained models with and without GLU if you would like to compare: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper
Hope that helps.
Hi, thank you for the response!
The data set I used is the data set made by my laboratory, and There is indeed overfit phenomenon in my model, although my dropout has been adjusted to 0.3. Thank you for your Suggestions, I will try to used larger dropout in the process of model training.
thanks again!
Hi, thanks for your interest. For IWSLT, we did not find GLU helpful. Which dataset are you using? We theorize it is because it's not necessary to have that additional capacity.
Also, we found in our experiments that with GLU, models can overfit, so we used larger dropout compared to models without GLU. So I would suggest tuning the dropout and l2 regularization if you try with GLU.
We also have pretrained models with and without GLU if you would like to compare: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper
Hope that helps.
I tried a set of comparative experiments using and not using GLU,and the result of my experiment was that the effect of using GLU was much worse。Is that normal? @myleott