Open constan1 opened 1 year ago
In my case it was 2 weeks on 2 RTX 3090, your 4 V100 cluster must be powerful enough to finish training from scratch in time.
Is there a specific model loss/validation loss you employed as a benchmark for convergence?
Hi how long did it take to train this model? I am currently training on my own implementation on a DGX 4 v100 cluster with deepspeed integrated. Gradient accumulation of 4 across micro batch size of 10 effectively 160 batch size. Still taking very, very long.