dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.55k stars 538 forks source link

[FEATURE]Horovod support for training transformer (PART 2) #1301

Closed hutao965 closed 4 years ago

hutao965 commented 4 years ago

Checklist

Essentials

[x] PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
[x] Changes are complete (i.e. I finished coding on this PR)
[ ] All changes have test coverage
[ ] Code is well-documented

Changes

add horovodrun example in machine_translation/README.md
transformer base/large result (test and valid bleu)
ShardedIterator (to split sampler) and testcase
add test case to test situation that fail downloading in test_util_misc.py

codecov[bot] commented 4 years ago

Codecov Report

Merging #1301 into master will increase coverage by 0.31%. The diff coverage is 97.14%.

@@            Coverage Diff             @@
##           master    #1301      +/-   ##
==========================================
+ Coverage   84.14%   84.45%   +0.31%     
==========================================
  Files          42       42              
  Lines        6397     6422      +25     
==========================================
+ Hits         5383     5424      +41     
+ Misses       1014      998      -16

Impacted Files	Coverage Δ
src/gluonnlp/data/sampler.py	`96.55% <97.14%> (+0.32%)`	:arrow_up:
src/gluonnlp/utils/misc.py	`52.53% <0.00%> (+0.63%)`	:arrow_up:
src/gluonnlp/data/loading.py	`83.39% <0.00%> (+5.28%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 32e87d4...db8afdb. Read the comment docs.

sxjscience commented 4 years ago

For the diff check failure, it might be caused by the fact that the exception is not triggered: https://github.com/dmlc/gluon-nlp/blob/32e87d4d4aa20a6eb658ee90d765ccffbd160571/src/gluonnlp/utils/misc.py#L562-L568

Thus, we may try to add a test-case which downloads an non-existing file and use pytest.assertRaises.

sxjscience commented 4 years ago

I find that the codecov is not very stable, especially the diff hit. @szha @leezu Would you know how could we improve it?

szha commented 4 years ago

assuming no race issue in result upload, this can be a result of non-determinism in tests, and the code path in the same test can vary.

if that's the case, we can mock the randomness to deterministically test each path. or if we still require the randomness, we can increase the trial number