Training medium/large complexity models on 4xA100 80GB GPUs

Saltb0xApps commented 12 months ago

Hi, I'm currently retraining MusicGen with 200k audio files. I have access to 4xA100 80GB GPUs, which is enough to train a 420.37M param model (small config) out of the box with this command - dora run -d solver=musicgen/musicgen_base_32khz

My GPU memory usage currently looks like this -

If I increase the complexity of the model any more than this, i.e., try using medium/large models, the GPUs run out of memory and I get a CUDA out-of-memory crash.

I've tried using fsdp.use=true autocast=false params, but that results in the model size becoming 105.09M params instead of 420.37M.

I'm going to be getting access to 16xA100 80GB GPUs soon and would like to train models with more than 400M parameters. Is there anything I can do to fit more complex models in the existing hardware?

Zlikster commented 11 months ago

Curious how will it go? What is aprox training time estimation on small/medium model with 200k files on 16xA100s?

Saltb0xApps commented 11 months ago

@Zlikster currently taking 2 weeks to get to epoch 256 on a small 437M param with 200k files with 4xA100 80Gb

Zlikster commented 11 months ago

Pretty expensive to go faster with 16xA100 80GB. Aprox 50$/h is the price for 16xA100. Let me know how did it went :)

hheavenknowss commented 10 months ago

Hi there, have you tried to train a medium/large models yet? I wonder how many A100 cards did you use to train a large model.

facebookresearch / audiocraft

Training medium/large complexity models on 4xA100 80GB GPUs #357