-
This is on my TODO for a while. Put it here in case I forget.
-
### Bug description
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds…
-
**System information**
- Have I written custom code: YES
- OS Platform and Distribution: CentOS 7.3
- TensorFlow installed from: pip
- TensorFlow version: 2.3.0
- Python version:3.7.7
- CPU ON…
-
After I read some of the codes, it's hard to fully understand how distributed training works with the code. I guess 'Experiments' is a wrapper that deals with the distributed learning but I'm not sure…
-
Thanks for your excellent work!
But I encountered some problems in training the KITTI dataset. I used two NVIDIA Gerforce 2080ti for training, and set --multiprocessing_distributed==True, --do_ onli…
-
i'm getting this any time i run any command (post training merge)
```
[2024-06-25 20:02:11,013] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to mps (auto detect)
W0625 …
-
### Search before asking
- [X] I have searched the YOLOv8 [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussions) and fou…
-
My machines used for multi-node training do not allow ssh service.
How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?
The servers which I use …
-
```
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb
ined/2021-05-31/work/i6_core/returnn/t…
-
I want to fine-tune the Pythia-6.9B language model on a dataset. The training requires about 90GB vRAM, so I need to use more than 1 gpus. (I use 3 A100 gpus, each with 40GB vRAM) I am trying to do th…