-
Thanks for your excellent work!
But I encountered some problems in training the KITTI dataset. I used two NVIDIA Gerforce 2080ti for training, and set --multiprocessing_distributed==True, --do_ onli…
-
The initiation of communication will fail if the number of nodes is set to 3.
which occur in [get_group](https://github.com/microsoft/SuperScaler/blob/fa80ad02c1dc855ca85b591fb689a09598d2cb7e/runtime…
-
Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previo…
-
### 🐛 Describe the bug
PyTorch deadlocks when using distributed training.
### To Reproduce
```
mport argparse
import os
import torch
import torch.distributed as dist
import torch.multiproces…
-
Hi, FutureXiang
Thanks for your code! When I'm training CIFAR-10, I encounter an error during distributed training.
`
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local…
-
Machines
- dual 4090 ada
- dual A4500
- single A6000
- single A4000
- single 3500 Ada
Concentrate on A6000 and A4000 with 10gbps networking
- https://www.tensorflow.org/guide/distributed_trai…
-
If you see the following error when building a dockerfile:
```
sh: 1: Bad substitution
```
It's likely caused by your dockerfile running `sh` and not `bash` which doesn't support variables wit…
-
### Willingness to contribute
Yes. I can contribute this feature independently.
### Proposal Summary
LLMs and other models are trained by running over multiple nodes with multiple GPUs spanning …
-
### 🐛 Describe the bug
code:
```python
from torchtext.vocab import build_vocab_from_iterator
import torchtext
from typing import Iterable, List
import random
import os
import torch
from tqdm …
-
### 软件环境
```Markdown
- paddlepaddle:
- paddlepaddle-gpu: 3.0.0b1
- paddlenlp: https://github.com/ZHUI/PaddleNLP/tree/sci/benchmark
```
### 重复问题
- [X] I have searched the existing issues
### 错误描…