-
### Bug description
When model training using DDP and pl.callbacks.BackboneFinetuning, it seems that model weights start to get out of sync across the processes after the backbone is unfrozen. Prio…
-
### Reminder
- [X] I have read the README and searched the existing issues.
### System Info
[2024-07-12 02:22:28,334] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda…
-
## Bug Description
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating…
-
I am encountering issues when using non-element-wise optimizers such as Adam-mini with DeepSpeed.
According to the documentation, it reads:
> The FP16 Optimizer is designed to maximize the achievable…
-
### Bug description
I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:
```
python /home/negroni/deeponet-fno/src/burgers/pytorch_deeponet.py --ngpus 3…
-
## ❓ Questions
Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that [https://github.com/facebo…
-
Hi, thank you for your work. I tried to use 4 GPUs to reproduce the result, and I set the epoch to 1 for debugging. But the process hangs when computing prototypes after the base step's training, and …
-
### Search before asking
- [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussion…
-
Hi,great job! really appreciate your amazing work.
however we have several 4080s cards that we try to accelerate training with,just test on your wonderful fast cross entropy kernal, but we are encoun…
-
### 🐛 Describe the bug
how can i use the ddp train in diffusion? i saw the train_ddp.yaml,but there is nothing different with the train_colossalai.yaml. how do i set the numbers of gpu and nodes or t…