-
## Integrating DeepSpeed with PyTorch Lightning
Integrating DeepSpeed with PyTorch Lightning can significantly enhance training efficiency and scalability, especially for large models and distribut…
-
### Issue Type
Documentation Feature Request
### Source
source
### Keras Version
Keras 2.13.1
### Custom Code
Yes
### OS Platform and Distribution
Linux Ubuntu 22.04
### Python version
3.9
…
-
Hi,
I have two RTX A6000 GPUs available for training (device IDs 0 and 1).
I run the GDRN training as: "./core/gdrn_modeling/train_gdrn.sh 0,1". The training starts as usual but it is much slowe…
-
I follow the [step-by-step-tutorial](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md) to run distributed training with mxnet and tensorflow, both hang.
I have 3 nodes…
-
Hello,
I saved all the files - config.json, metadata.list as UTF-8 without BOM format, while when running the training bash
bash train.sh ./data/example/config.json 1
it always report the
…
-
why write here `parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')`,if i want to use DDP, should i change to 0
-
you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl
seperately?And how do you set your environment variable?I am queite confused about it.Thanks …
-
Hello,
Any plans to have a script for training XLNet on distributed GPUs?
Maybe with Horovod or MultiWorkerMirroredStrategy?
-
Current state:
https://gist.github.com/louis030195/9a5cf53415989d8191508a796e00f754
-
### Description
hello everyone,
I'm a newbie with t2t and tensorflow. I tried to use t2t to run transformer_moe model in 2 machines ,but failed. Each one has only one gpu. Hope you guys could help…