ZiyueHuang commented 3 years ago

Description

We should additionally normalize the loss by num_accumulated * len(ctx_l). I observed this issue since the grad_norm in the finetune log is very large (around 70), while in the pretraining log it is usually around 2.

To see why the current version is wrong intuitively, consider two commands python3 run_squad.py --batch_size 12 --num_accumulated 1 ... and python3 run_squad.py --batch_size 1 --num_accumulated 12 ..., they should give the same results when using MXNet Trainer. But this is not true currently, by print(total_norm), you can observe that the grad_norm is much larger (nearly 10 times) in the second command than the first command. Meanwhile, note that the other operations (clip_grad_global_norm and trainer.update) are the same across these two commands, thus the behaviors of them will diverge, and this divergence will be more significant if max_grad_norm is larger. The reason is that the aggregated gradients are first normalized (by batch_size) in the mini-batch (see span_loss = ....mean(), answerable_loss = ....mean()) and then accumulate by num_accumulated * len(ctx_l) times. Also, note that num_workers is the number of machines (not the number of GPUs) when using MXNet Trainer.

Then why would the current version also give satisfied results? Because for most of the times, grad_norm is much larger (more than num_accumulated * len(ctx_l) times) than max_grad_norm, then it will make no difference whether we normalize the loss in the correct way, since the gradients would be clipped to max_grad_norm anyway. But things would be different when grad_norm is not so large or max_grad_norm is large.

I tested it on BERT base model, and the performance are the same before/after this PR. I think it is because the grad_norm is much larger than max_grad_norm=0.1.

cc @sxjscience

Checklist

Essentials

[ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
[ ] Changes are complete (i.e. I finished coding on this PR)
[ ] All changes have test coverage
[ ] Code is well-documented

Changes

[ ] Feature1, tests, (and when applicable, API doc)
[ ] Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1394/fix_squad_loss/index.html

codecov[bot] commented 3 years ago

Codecov Report

Merging #1394 into master will increase coverage by 0.15%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1394      +/-   ##
==========================================
+ Coverage   85.12%   85.28%   +0.15%     
==========================================
  Files          53       53              
  Lines        6959     6959              
==========================================
+ Hits         5924     5935      +11     
+ Misses       1035     1024      -11

Impacted Files	Coverage Δ
src/gluonnlp/data/filtering.py	`78.26% <0.00%> (-4.35%)`	:arrow_down:
src/gluonnlp/data/tokenizers/yttm.py	`81.89% <0.00%> (-0.87%)`	:arrow_down:
src/gluonnlp/utils/misc.py	`59.75% <0.00%> (+0.92%)`	:arrow_up:
src/gluonnlp/data/loading.py	`83.39% <0.00%> (+5.28%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8ef4b26...e91fb47. Read the comment docs.

dmlc / gluon-nlp

fix squad loss #1394