facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

Dynamic convolution with GLU get better or worse result? #578

Closed jiezhangGt closed 5 years ago

jiezhangGt commented 5 years ago

I tried a set of comparative experiments using and not using GLU,and the result of my experiment was that the effect of using GLU was much worse。Is that normal? @myleott

huihuifan commented 5 years ago

Hi, thanks for your interest. For IWSLT, we did not find GLU helpful. Which dataset are you using? We theorize it is because it's not necessary to have that additional capacity.

Also, we found in our experiments that with GLU, models can overfit, so we used larger dropout compared to models without GLU. So I would suggest tuning the dropout and l2 regularization if you try with GLU.

We also have pretrained models with and without GLU if you would like to compare: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper

Hope that helps.

jiezhangGt commented 5 years ago

Hi, thank you for the response!

The data set I used is the data set made by my laboratory, and There is indeed overfit phenomenon in my model, although my dropout has been adjusted to 0.3. Thank you for your Suggestions, I will try to used larger dropout in the process of model training.

thanks again!

jiezhangGt commented 5 years ago

Hi, thanks for your interest. For IWSLT, we did not find GLU helpful. Which dataset are you using? We theorize it is because it's not necessary to have that additional capacity.

Also, we found in our experiments that with GLU, models can overfit, so we used larger dropout compared to models without GLU. So I would suggest tuning the dropout and l2 regularization if you try with GLU.

We also have pretrained models with and without GLU if you would like to compare: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper

Hope that helps.