-
## Description
The CUJ looks like:
```
envd run --image xx --replicas 20
```
Then there will be one interactive shell, and users can type a command, which will run in all replicas.
Then …
-
Hi,
I have two RTX A6000 GPUs available for training (device IDs 0 and 1).
I run the GDRN training as: "./core/gdrn_modeling/train_gdrn.sh 0,1". The training starts as usual but it is much slowe…
-
### bug描述 Describe the Bug
报错信息:
我们使用容器网络训练xpu任务 报了个这个错误
Traceback (most recent call last):
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "…
-
why write here `parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')`,if i want to use DDP, should i change to 0
-
Hello, I hope this message finds you well. I am reaching out to inquire about the best practices for debugging distributed training setups, especially when deploying to Kubernetes with Docker. Could y…
-
I follow the [step-by-step-tutorial](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md) to run distributed training with mxnet and tensorflow, both hang.
I have 3 nodes…
-
### System Info
```shell
accelerate 1.1.1
neuronx-cc 2.14.227.0+2d4f85be
neuronx-distributed 0.8.0
neuronx-distributed-training 1.0.0
optimum …
-
Hello author.
The following codes and options were used for the training. (Code rewritten to work with that option, otherwise unchanged)
`python3 -m torch.distributed.launch --nproc_per_node=1 tra…
-
Hi, is this project able to use distributed training in multiple nodes?
-
[TensorFlow v0.8](http://www.theregister.co.uk/2016/04/14/tensorflow_08_google_release/) offers a [way to train in parallel](http://googleresearch.blogspot.com/2016/04/announcing-tensorflow-08-now-wit…