ZiyueHuang commented 4 years ago

Description

For each worker, we should normalize the loss by the number of elements to improve numerical stability and can avoid setting step_size in trainer.update(), and this is consistent with the official TF implementation, see https://github.com/google-research/electra/blob/master/run_pretraining.py#L181 and https://github.com/google-research/electra/blob/master/run_pretraining.py#L206.

To check the correctness of the Horovod support, let k denote the number of workers,

Ground truth:

g = \frac{1}{k} \sum_{i=1}{k} g_i

update_gt = clip_grad_global_norm(g, max_norm)

In our code

G = \sum_{i=1}{k} g_i     (by trainer.allreduce_grads())

update = \frac{1}{k} * clip_grad_global_norm(G, max_norm * k)

         (note that \frac{1}{k} is the scaling factor inside hvd.trainer)

       = \frac{1}{k} * clip_grad_global_norm(g * k, max_norm * k)

       = \frac{1}{k} * k * clip_grad_global_norm(g, max_norm)

       = update_gt

And this part of implementation for Horovod support is consistent with BERT script, see https://github.com/dmlc/gluon-nlp/blob/v0.10.x/scripts/bert/run_pretraining.py.

@sxjscience Could you also check the correctness?

Also, we need re-train the model and adjust the hyper-parameters accordingly.

Checklist

Essentials

[ ] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
[ ] Changes are complete (i.e. I finished coding on this PR)
[ ] All changes have test coverage
[ ] Code is well-documented

Changes

[ ] Feature1, tests, (and when applicable, API doc)
[ ] Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

sxjscience commented 4 years ago

LGTM. Need to revise the usage of nd.reshape --> npx.reshape.

codecov[bot] commented 4 years ago

Codecov Report

Merging #1319 into master will decrease coverage by 0.09%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1319      +/-   ##
==========================================
- Coverage   84.45%   84.36%   -0.10%     
==========================================
  Files          42       42              
  Lines        6426     6426              
==========================================
- Hits         5427     5421       -6     
- Misses        999     1005       +6

Impacted Files	Coverage Δ
src/gluonnlp/data/loading.py	`81.13% <0.00%> (-2.27%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 970318d...f2c8df5. Read the comment docs.