-
## ❓ Questions and Help
When starting gpu spmd training with `torchrun`, why does it need to be compiled once per machine? Although the resulting graph is the same. Is there any way to avoid it
-
Hi, I try the distributed training with 2 machines. There are 4 GPUs in each machine.
in the master machine, I run:
python -u tools/run_net.py \
--cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \
--…
-
### System Info
```Shell
- `Accelerate` version: 0.33.0
- `accelerate` bash location: /miniconda3/envs/SDXL/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.24.4
- PyTorch version (…
-
This is on my TODO for a while. Put it here in case I forget.
-
When I run this example [runs on multiple gpus using Distributed Data Parallel (DDP) training](https://docs.lightly.ai/self-supervised-learning/examples/simclr.html) on AWS SageMaker with 4 GPUS and …
-
After I read some of the codes, it's hard to fully understand how distributed training works with the code. I guess 'Experiments' is a wrapper that deals with the distributed learning but I'm not sure…
-
Hi, I appreciate your repos. I've been using clip-iqa model in your repo for studying purpose.
It worked well on single-gpu setting when I follow your simple training scripts.
I want to use distri…
-
Hi, when i run the latest v1.3 code for fine-tuning, it fails the training every-time when the program tries to save the checkpoint, as shown below. I have never met this issue when running the previo…
-
## 🐛 Bug
I am trying to use aim remote serrver to track experiments. I'm able to use the aim remote server without any issues when training with a single GPU but I get an rpc error when using distrib…
-
### Description
Hi,
This is the only Terraform script from @sfloresk I could find that gives a complete example of running distributed training workloads with GPUs on AWS:
https://github.com/aw…