Closed zheyuye closed 4 years ago
Merging #1353 into master will decrease coverage by
0.11%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #1353 +/- ##
==========================================
- Coverage 81.65% 81.53% -0.12%
==========================================
Files 52 52
Lines 6862 6901 +39
==========================================
+ Hits 5603 5627 +24
- Misses 1259 1274 +15
Impacted Files | Coverage Δ | |
---|---|---|
src/gluonnlp/utils/parameter.py | 86.53% <0.00%> (-8.92%) |
:arrow_down: |
src/gluonnlp/utils/preprocessing.py | 94.28% <0.00%> (-5.72%) |
:arrow_down: |
src/gluonnlp/data/loading.py | 78.11% <0.00%> (-3.02%) |
:arrow_down: |
conftest.py | 85.00% <0.00%> (-0.72%) |
:arrow_down: |
setup.py | 0.00% <0.00%> (ø) |
|
src/gluonnlp/models/bart.py | 70.62% <0.00%> (ø) |
|
src/gluonnlp/models/bert.py | 87.67% <0.00%> (ø) |
|
src/gluonnlp/models/albert.py | 87.80% <0.00%> (ø) |
|
src/gluonnlp/models/electra.py | 57.89% <0.00%> (ø) |
|
src/gluonnlp/models/mobilebert.py | 87.69% <0.00%> (ø) |
|
... and 7 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 203d84c...0de5093. Read the comment docs.
Please do not merge this till I re-run the mobilebert with horovod again
@sxjscience I got the following results after the revision
{
"exact": 41.20272888065358,
"f1": 44.673086401387295,
"total": 11873,
"HasAns_exact": 82.52361673414305,
"HasAns_f1": 89.47428388051136,
"HasAns_total": 5928,
"NoAns_exact": 0.0,
"NoAns_f1": 0.0,
"NoAns_total": 5945,
"best_exact": 77.25090541564896,
"best_exact_thresh": -1.8302667140960693,
"best_f1": 80.05376875130503,
"best_f1_thresh": -1.6563208103179932,
"best_ckpt": "google_uncased_mobilebert_squad2.0_20615.params"
}
which is slightly lower than previous results (80.54/77.81) that fine-tuned without horovod.
However, horovod do speed up the fine-tuning phrase saving roughly half the time (2.55 vs 5.05 hours)
@ZiyueHuang @szhengac Would you think that the change here is reasonable? The change is that each work will calculate the average loss and then we do allreduce. One issue is that the batch-size of each worker may not be the same. However, I think it should be negligible.
then we should check how large the difference is for the batch size. If it is large, consider changing sampling strategy to make them closer
On Thu, 10 Sep 2020 at 21:05, Xingjian Shi notifications@github.com wrote:
@ZiyueHuang https://github.com/ZiyueHuang @szhengac https://github.com/szhengac Would you think that the change here is reasonable? The change is that each work will calculate the average loss and then we do allreduce. One issue is that the batch-size of each worker may not be the same. However, I think it should be negligible.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/dmlc/gluon-nlp/pull/1353#issuecomment-690861056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6GZVASHNQGUB4SNQYRHYLSFGOZVANCNFSM4Q6LO4CQ .
@szhengac @ZheyuYe I just noticed that our current implementation ensures that the batch_size will always be the same.
@ZheyuYe I think the performance degradation may be caused by the SplitSampler: https://github.com/dmlc/gluon-nlp/blob/0223686dbc25bae416dfa764dc94ef2f6664fb00/scripts/question_answering/run_squad.py#L447-L448.
Basically, the current code divides the input into num_workers
parts, shuffle them internally, and then gets the minibatch. Thus, it will be hard to mix the samples from the same group. E.g., assume that the local batch_size is 2 and there are 4 workers. The effective batch_size should be 8. However, it will be impossible to sample 4 samples that belong to the same local worker.
There are two ways to solve this issue:
Let me call these sampling methods as A (original), B (ShardedIterator), C (Different seeds). We may have
B > C, A. B guarantees sampling without replacement and will fully mix the samples. C guarantees sample with replacement. A is sampling without replacement but cannot fully mix the data.
@szhengac @ZiyueHuang Correct me if I'm wrong here.
SplitSampler
serves for data sharding among the workers, which may improve the performance of distributed training and is a common practice.
I took a look over sampler.py
and dataloader.py
, in the current implementation, for a single worker : it will shuffle the dataset and touch the instances one-by-one. For multiple workers : it will first partition the dataset based on the indexes, and then shuffle within each shard. These two procedures are not equivalent, and the second one may bring a little bias. However, I think this subtle difference may have little influence in practice. One way to solve this is to first shuffle the dataset (using the same seed across all workers) and then partition the dataset based on the indexes.
@ZheyuYe Per-offline discussion with @ZiyueHuang , we find that it may be due to the fact that the current implementation has not shuffled the input samples before sharding: https://github.com/dmlc/gluon-nlp/blob/3fbe9619d9e68bc665f73c8cdf683213c6edd4d6/src/gluonnlp/data/sampler.py#L551-L564
Basically, we will need to first permutate the training samples before dividing them into shards. For example, it may happen that the initial distribution is highly biased, e.g., the first half are all positive samples and the second half are all negative samples.
Thus, we should shuffle the training indices with the same pre_sharding_seed
and then shuffle the samples within each shard with the seed
.
Description
Fix squad to support horovod as https://github.com/dmlc/gluon-nlp/pull/1319 did for pretraining electra
Got a result of 77.62/80.39 for mobilebert. See https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering for the comparison
Comments
@sxjscience @ZiyueHuang
cc @dmlc/gluon-nlp-team