dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

Fix the correctness of the Horovod support on squad #1353

Closed zheyuye closed 4 years ago

zheyuye commented 4 years ago

Description

Fix squad to support horovod as https://github.com/dmlc/gluon-nlp/pull/1319 did for pretraining electra

Got a result of 77.62/80.39 for mobilebert. See https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering for the comparison

Comments

@sxjscience @ZiyueHuang

cc @dmlc/gluon-nlp-team

codecov[bot] commented 4 years ago

Codecov Report

Merging #1353 into master will decrease coverage by 0.11%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1353      +/-   ##
==========================================
- Coverage   81.65%   81.53%   -0.12%     
==========================================
  Files          52       52              
  Lines        6862     6901      +39     
==========================================
+ Hits         5603     5627      +24     
- Misses       1259     1274      +15     
Impacted Files Coverage Δ
src/gluonnlp/utils/parameter.py 86.53% <0.00%> (-8.92%) :arrow_down:
src/gluonnlp/utils/preprocessing.py 94.28% <0.00%> (-5.72%) :arrow_down:
src/gluonnlp/data/loading.py 78.11% <0.00%> (-3.02%) :arrow_down:
conftest.py 85.00% <0.00%> (-0.72%) :arrow_down:
setup.py 0.00% <0.00%> (ø)
src/gluonnlp/models/bart.py 70.62% <0.00%> (ø)
src/gluonnlp/models/bert.py 87.67% <0.00%> (ø)
src/gluonnlp/models/albert.py 87.80% <0.00%> (ø)
src/gluonnlp/models/electra.py 57.89% <0.00%> (ø)
src/gluonnlp/models/mobilebert.py 87.69% <0.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 203d84c...0de5093. Read the comment docs.

zheyuye commented 4 years ago

Please do not merge this till I re-run the mobilebert with horovod again

zheyuye commented 4 years ago

@sxjscience I got the following results after the revision

{
    "exact": 41.20272888065358,
    "f1": 44.673086401387295,
    "total": 11873,
    "HasAns_exact": 82.52361673414305,
    "HasAns_f1": 89.47428388051136,
    "HasAns_total": 5928,
    "NoAns_exact": 0.0,
    "NoAns_f1": 0.0,
    "NoAns_total": 5945,
    "best_exact": 77.25090541564896,
    "best_exact_thresh": -1.8302667140960693,
    "best_f1": 80.05376875130503,
    "best_f1_thresh": -1.6563208103179932,
    "best_ckpt": "google_uncased_mobilebert_squad2.0_20615.params"
}

which is slightly lower than previous results (80.54/77.81) that fine-tuned without horovod.

However, horovod do speed up the fine-tuning phrase saving roughly half the time (2.55 vs 5.05 hours)

sxjscience commented 4 years ago

@ZiyueHuang @szhengac Would you think that the change here is reasonable? The change is that each work will calculate the average loss and then we do allreduce. One issue is that the batch-size of each worker may not be the same. However, I think it should be negligible.

szhengac commented 4 years ago

then we should check how large the difference is for the batch size. If it is large, consider changing sampling strategy to make them closer

On Thu, 10 Sep 2020 at 21:05, Xingjian Shi notifications@github.com wrote:

@ZiyueHuang https://github.com/ZiyueHuang @szhengac https://github.com/szhengac Would you think that the change here is reasonable? The change is that each work will calculate the average loss and then we do allreduce. One issue is that the batch-size of each worker may not be the same. However, I think it should be negligible.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/dmlc/gluon-nlp/pull/1353#issuecomment-690861056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6GZVASHNQGUB4SNQYRHYLSFGOZVANCNFSM4Q6LO4CQ .

sxjscience commented 4 years ago

@szhengac @ZheyuYe I just noticed that our current implementation ensures that the batch_size will always be the same.

sxjscience commented 4 years ago

@ZheyuYe I think the performance degradation may be caused by the SplitSampler: https://github.com/dmlc/gluon-nlp/blob/0223686dbc25bae416dfa764dc94ef2f6664fb00/scripts/question_answering/run_squad.py#L447-L448.

Basically, the current code divides the input into num_workers parts, shuffle them internally, and then gets the minibatch. Thus, it will be hard to mix the samples from the same group. E.g., assume that the local batch_size is 2 and there are 4 workers. The effective batch_size should be 8. However, it will be impossible to sample 4 samples that belong to the same local worker.

There are two ways to solve this issue:

Let me call these sampling methods as A (original), B (ShardedIterator), C (Different seeds). We may have

B > C, A. B guarantees sampling without replacement and will fully mix the samples. C guarantees sample with replacement. A is sampling without replacement but cannot fully mix the data.

@szhengac @ZiyueHuang Correct me if I'm wrong here.

ZiyueHuang commented 4 years ago

SplitSampler serves for data sharding among the workers, which may improve the performance of distributed training and is a common practice.

I took a look over sampler.py and dataloader.py, in the current implementation, for a single worker : it will shuffle the dataset and touch the instances one-by-one. For multiple workers : it will first partition the dataset based on the indexes, and then shuffle within each shard. These two procedures are not equivalent, and the second one may bring a little bias. However, I think this subtle difference may have little influence in practice. One way to solve this is to first shuffle the dataset (using the same seed across all workers) and then partition the dataset based on the indexes.

sxjscience commented 4 years ago

@ZheyuYe Per-offline discussion with @ZiyueHuang , we find that it may be due to the fact that the current implementation has not shuffled the input samples before sharding: https://github.com/dmlc/gluon-nlp/blob/3fbe9619d9e68bc665f73c8cdf683213c6edd4d6/src/gluonnlp/data/sampler.py#L551-L564

Basically, we will need to first permutate the training samples before dividing them into shards. For example, it may happen that the initial distribution is highly biased, e.g., the first half are all positive samples and the second half are all negative samples.

Thus, we should shuffle the training indices with the same pre_sharding_seed and then shuffle the samples within each shard with the seed.