asyml / texar-pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
744 stars 118 forks source link

Introduce distributed Adaptive API #326

Closed odp closed 3 years ago

odp commented 3 years ago

This PR adds distributed Adaptive API to Texar-PyTorch with the help of AdaptDL.

Specifically

  1. tx.distributed.AdaptiveDataParallel which is similar to torch.nn.parallel.DistributedDataParallel, should be used to wrap the model, optimizer and lr_scheduler at the beginning of the code. This makes them distributed, restart-safe and adaptive.

  2. tx.distributed.AdaptiveDataIterator which mimics tx.data.DataIterator but provides an adaptive, distributed version of it. Using this, the iterators become restart-safe (able to reposition correctly after a scale-up or scale-down of training) and distributed (data is automatically partitioned across replicas)

Using these APIs and few others directly from the adaptdl.torch API, Texar-PyTorch models can be trained on an AdaptDL cluster in elastic, distributed data-parallel fashion. examples/bert/bert_classifier_adaptive.py is the adaptive version of examples/bert/bert_classifier_main.py which demonstrates the use of above API. It can be trained on a cluster by running examples/bert/run_bert_adaptive.sh after setting up a AdaptDL kubernetes cluster or microk8s environment.

The API is compatible with standalone training on a single node machine or a multinode (non-k8s) cluster. However this mode does not support elasticity, but can be used for testing.

codecov[bot] commented 3 years ago

Codecov Report

Merging #326 (e001b0a) into master (f016043) will decrease coverage by 0.00%. The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #326      +/-   ##
==========================================
- Coverage   80.14%   80.14%   -0.01%     
==========================================
  Files         134      135       +1     
  Lines       11195    11220      +25     
==========================================
+ Hits         8972     8992      +20     
- Misses       2223     2228       +5     
Impacted Files Coverage Δ
texar/torch/data/data/data_base.py 82.97% <ø> (ø)
texar/torch/distributed/__init__.py 80.00% <86.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f016043...e001b0a. Read the comment docs.