-
Thank you for your excellent work. You used a single V100 GPU for training. Will the programme support distributed training? We are trying to use multiple 4090 GPUs on the same machine to repeat the e…
-
Thanks for your code!
Could you share the scripts about the DDP Training?
-
I'd like to propose a feature for implementing fail-safe mechanisms and partial redundancy in FSDP2 (possibly not FSDP already, more like HSDP) to allow for more robust training on unreliable compute …
-
Hello author, may I ask if it is possible to add features for recovery training and DDP training. May I ask if training from scratch can achieve your level of accuracy
T-wow updated
1 month ago
-
### Background
In distributed training scenarios, RNG initialization matters for ensuring a correct model initialization, and in some cases also controlling random ops during training (e.g. dropout)…
-
### Search before asking
- [x] I have searched the YOLOv8 [issues](https://github.com/ultralytics/ultralytics/issues) and found no similar bug report.
### YOLOv8 Component
Train, Multi-GPU
### Bu…
-
Hi Torch Team,
I am currently experimenting with native torch float8 distributed training using the delayed scaling recipe on GPT 1.5B with DDP at batch=12 seq=1024 on an HGX 8xH100 (700W H100 SXM …
-
pytroch:1.3.1
python:3.6
system:ubuntu 16
cuda:10.0
when i run imagenet main.py in multi-nodes ,there is a error likes,(single node can run ):
Use GPU: 1 for training
Use GPU: 0 for training
…
-
### Bug description
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
### What version are you seeing the p…
-
Thank you for sharing the fantastic work.
As I do not have the SLURM cluster, Is there the DDP training code?
Or anyone can help?