-
我使用两台机器,每台机器4张显卡。运行命令:accelerate launch --dynamo_backend no --machine_rank 0 --main_process_ip 192.168.68.249 --main_process_port 27828 --mixed_precision no --multi_gpu --num_machines 2 --num_processe…
-
Encountered the following error while trying to run train_decoder.py
```sh
[2024-11-05 06:09:22,154] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
…
-
### Bug description
The DDP training stuck at the 1st iter, and it's always waiting for pid:
![image](https://github.com/user-attachments/assets/e85d5e39-a24e-41e0-8bea-bcaa004a3473)
os.waitpid()…
-
### Reminder
- [X] I have read the README and searched the existing issues.
### System Info
- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.31
-…
-
Hi, I'm using the tutorial [https://github.com/pytorch/tutorials/blob/master/intermediate_source/ddp_tutorial.rst](url) for DDP train,using 4 gpus in myself code, reference Basic Use Case. But when I …
-
**Describe the bug**
After training a TFT with `ddp_spawn`strategy on multiple gpus in Amazon SageMaker the returned prediction of the trainer is None, leading to an `TypeError: 'NoneType' object is …
-
Torch splits TensorDatasets across processes when using a distributed data-parallel strategy (default with multiple CUDA-enabled GPUs)
https://pytorch.org/docs/stable/data.html#loading-batched-and-no…
-
Hey everyone,
We wanted to let you know that we are considering the deprecation of DataParallel (`torch.nn.DataParallel`, a.k.a. DP) module with the upcoming v1.11 release of PyTorch. Our plan is t…
-
## Bug Description
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating…
-
When I run this example [runs on multiple gpus using Distributed Data Parallel (DDP) training](https://docs.lightly.ai/self-supervised-learning/examples/simclr.html) on AWS SageMaker with 4 GPUS and …