Closed odp closed 4 years ago
Merging #326 (e001b0a) into master (f016043) will decrease coverage by
0.00%
. The diff coverage is86.66%
.
@@ Coverage Diff @@
## master #326 +/- ##
==========================================
- Coverage 80.14% 80.14% -0.01%
==========================================
Files 134 135 +1
Lines 11195 11220 +25
==========================================
+ Hits 8972 8992 +20
- Misses 2223 2228 +5
Impacted Files | Coverage Δ | |
---|---|---|
texar/torch/data/data/data_base.py | 82.97% <ø> (ø) |
|
texar/torch/distributed/__init__.py | 80.00% <86.66%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f016043...e001b0a. Read the comment docs.
This PR adds distributed Adaptive API to Texar-PyTorch with the help of AdaptDL.
Specifically
tx.distributed.AdaptiveDataParallel
which is similar totorch.nn.parallel.DistributedDataParallel
, should be used to wrap the model, optimizer and lr_scheduler at the beginning of the code. This makes them distributed, restart-safe and adaptive.tx.distributed.AdaptiveDataIterator
which mimicstx.data.DataIterator
but provides an adaptive, distributed version of it. Using this, the iterators become restart-safe (able to reposition correctly after a scale-up or scale-down of training) and distributed (data is automatically partitioned across replicas)Using these APIs and few others directly from the
adaptdl.torch
API, Texar-PyTorch models can be trained on an AdaptDL cluster in elastic, distributed data-parallel fashion.examples/bert/bert_classifier_adaptive.py
is the adaptive version ofexamples/bert/bert_classifier_main.py
which demonstrates the use of above API. It can be trained on a cluster by runningexamples/bert/run_bert_adaptive.sh
after setting up a AdaptDL kubernetes cluster or microk8s environment.The API is compatible with standalone training on a single node machine or a multinode (non-k8s) cluster. However this mode does not support elasticity, but can be used for testing.