facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.18k stars 2.01k forks source link

Training medium/large complexity models on 4xA100 80GB GPUs #357

Open Saltb0xApps opened 7 months ago

Saltb0xApps commented 7 months ago

Hi, I'm currently retraining MusicGen with 200k audio files. I have access to 4xA100 80GB GPUs, which is enough to train a 420.37M param model (small config) out of the box with this command - dora run -d solver=musicgen/musicgen_base_32khz

My GPU memory usage currently looks like this -

Screenshot 2023-11-26 at 2 27 02 PM

If I increase the complexity of the model any more than this, i.e., try using medium/large models, the GPUs run out of memory and I get a CUDA out-of-memory crash.

I've tried using fsdp.use=true autocast=false params, but that results in the model size becoming 105.09M params instead of 420.37M.

I'm going to be getting access to 16xA100 80GB GPUs soon and would like to train models with more than 400M parameters. Is there anything I can do to fit more complex models in the existing hardware?

Zlikster commented 7 months ago

Curious how will it go? What is aprox training time estimation on small/medium model with 200k files on 16xA100s?

Saltb0xApps commented 7 months ago

@Zlikster currently taking 2 weeks to get to epoch 256 on a small 437M param with 200k files with 4xA100 80Gb

Zlikster commented 7 months ago

Pretty expensive to go faster with 16xA100 80GB. Aprox 50$/h is the price for 16xA100. Let me know how did it went :)

hheavenknowss commented 6 months ago

Hi there, have you tried to train a medium/large models yet? I wonder how many A100 cards did you use to train a large model.