Open Saltb0xApps opened 12 months ago
Curious how will it go? What is aprox training time estimation on small/medium model with 200k files on 16xA100s?
@Zlikster currently taking 2 weeks to get to epoch 256 on a small 437M param with 200k files with 4xA100 80Gb
Pretty expensive to go faster with 16xA100 80GB. Aprox 50$/h is the price for 16xA100. Let me know how did it went :)
Hi there, have you tried to train a medium/large models yet? I wonder how many A100 cards did you use to train a large model.
Hi, I'm currently retraining MusicGen with 200k audio files. I have access to 4xA100 80GB GPUs, which is enough to train a 420.37M param model (small config) out of the box with this command -
dora run -d solver=musicgen/musicgen_base_32khz
My GPU memory usage currently looks like this -
If I increase the complexity of the model any more than this, i.e., try using medium/large models, the GPUs run out of memory and I get a CUDA out-of-memory crash.
I've tried using
fsdp.use=true autocast=false
params, but that results in the model size becoming 105.09M params instead of 420.37M.I'm going to be getting access to 16xA100 80GB GPUs soon and would like to train models with more than 400M parameters. Is there anything I can do to fit more complex models in the existing hardware?