ZiyueHuang commented 3 years ago

Description

I finetuned all pretraining models on SQUAD 2.0 via AWS Batch (suggested by Xingjian) after my fix (https://github.com/dmlc/gluon-nlp/pull/1394). However, I found that some models perform worse, for example, the performance of Electra-Large drops from 90.67/88.32 to 88.95/86.11. Then I checked the hyperparameters, I found that, compared to the ELECTRA paper, MAX_GRAD_NORM is 10 times smaller and LR is 5 times smaller in run_squad2_electra_large.sh, while the other parts are the same (namely, pretrained weights, dataset, training method, the other hyperparameters...). Any particular reason for this setting?

MAX_GRAD_NORM=0.1 (in all run_squad_MODEL.sh) seems weird, while in all other places (e.g. official BERT/ELECTRA/ALBERT...) MAX_GRAD_NORM is always 1 for the pretraining task and all finetuning tasks (GLUE/SQUAD etc.). In the ELECTRA paper, for finetuning tasks, the authors perform hyperparameter search only over learning rate and layer-wise learning-rate decay.

Following the ELECTRA paper, I adopt the setting LR=5e-05 and MAX_GRAD_NORM=1 instead of LR=1e-05 and MAX_GRAD_NORM=0.1 in run_squad2_electra_large.sh. Then I obtained 90.83/88.33, matching the result 90.6/88.0 shown in the Electra paper Table 4. Here is the log, fintune_google_electra_large_squad_2.0.log

The hyperparameters for other models need also be checked and adjusted...

cc @sxjscience @szhengac

Checklist

Essentials

[ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
[ ] Changes are complete (i.e. I finished coding on this PR)
[ ] All changes have test coverage
[ ] Code is well-documented

Changes

[ ] Feature1, tests, (and when applicable, API doc)
[ ] Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

codecov[bot] commented 3 years ago

Codecov Report

Merging #1396 into master will decrease coverage by 0.01%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1396      +/-   ##
==========================================
- Coverage   85.25%   85.24%   -0.02%     
==========================================
  Files          53       53              
  Lines        6959     6959              
==========================================
- Hits         5933     5932       -1     
- Misses       1026     1027       +1

Impacted Files	Coverage Δ
src/gluonnlp/data/tokenizers/yttm.py	`81.89% <0.00%> (-0.87%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 149270a...35caa7b. Read the comment docs.

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1396/fix_squad_hparam/index.html

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1396/fix_squad_hparam/index.html

sxjscience commented 3 years ago

There is no need to adjust all hparams. According to my experience, the models work well for a wide range of hparams and there is no need to stick to one set of hparam. All you need is to ensure that it obtains roughly the same dev best EM/F1 as in the paper. I think we did a basic hparam search and find that the current combination, i.e. 0.1 grad clipping + lr works well too.

ZiyueHuang commented 3 years ago

@sxjscience "no need to stick to one set of hparam"

Here the goal is to reproduce. I didn't say that we should stick to one set of hparam. I open this PR because the squad command now cannot reproduce Electra-Large.

ZiyueHuang commented 3 years ago

I'm saying that max_grad_norm is stick to 1 in all other papers, so I think there is no need to perform hparam search over max_grad_norm. Otherwise, when other people use our SQUAD finetuning code, to make the comparison fair, they need also additionally perform hparam search over max_grad_norm for other baseline models (like BERT,ELECTRA,ALBERT)...

ZiyueHuang commented 3 years ago

"There is no need to adjust all hparams."

Yeah, we can adjust hparam only for the model which cannot reproduce the official result. I will check tomorrow...

sxjscience commented 3 years ago

In fact, I believe that the original hparam “reproduces” most existing results. Our SQuAD fine tuning is not exactly the same as all existing SQuAD fine tuning script because we are using the “encode_with_offsets” to convert character-level span to token-level span, which is much faster.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Ziyue Huang notifications@github.com Sent: Wednesday, October 21, 2020 8:25:16 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com Subject: Re: [dmlc/gluon-nlp] [WIP] fix the hyperparameters for squad (#1396)

"There is no need to adjust all hparams."

Yeah, we can adjust hparam only for the model which cannot reproduce the official result. I will check tomorrow...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1396#issuecomment-713657468, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3W3GA7H3TCJNF2CRL3SL34NZANCNFSM4SZVJBIQ.

ZiyueHuang commented 3 years ago

The following are the results (excluding electra-large since I fixed it, and albert-xxlarge has not finished yet) using the current master.

electra-small 74.62/71.93
electra-base 86.38/83.76

bert-base 76.09/73.21
bert-large 81.51/78.62

roberta-large 89.70/86.79

mobile-bert 79.97/77.39

albert-base 82.32/79.33
albert-large 85.31/82.36
albert-xlarge 87.85/85.02

ZiyueHuang commented 3 years ago

Done.

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1396/fix_squad_hparam/index.html

zheyuye commented 3 years ago

Thanks for pointing that out. The better learning rate for ELECTRA LARGE might be 5e-5 as paper suggested as well as https://github.com/dmlc/gluon-nlp/blob/99b35d8bed5eb375c195755375cbc1b459ee616e/scripts/question_answering/commands/run_squad2_electra_large.sh#L20

sxjscience commented 3 years ago

@ZiyueHuang After checking the git history again, I think the bug is caused by https://github.com/dmlc/gluon-nlp/pull/1378/files#diff-ba6c6edac414e9b924a8524f9b254fb71b62e69cc25df027c953f560c5a37dbdR9. I mistakenly used the wrong lr for Electra Large. I checked the other results you have reported and they look normal.

dmlc / gluon-nlp

fix the hyperparameters for Electra-Large on squad #1396

Description

Checklist

Essentials

Changes

Comments

Codecov Report