facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.17k stars 2.01k forks source link

Absurdly slow training? #388

Open DEBIHOOD opened 5 months ago

DEBIHOOD commented 5 months ago

I've been playing around with training my own model, unconditional, initialized from scratch(dim 512, num_heads 8, num_layers 8: Total 33.57 M params) with context size of 6 seconds (300 tokens), on the dataset that i collected(133 hours of music). But the model progresses absurdly slow, i've been training for the past 20 hours, and generated samples sounds like it's doing almost nothing(or at least awfully bad).

I need to clarify, i don't have much experience of training transformers, but here, in Andrej Karpathy's nanoGPT, he claims that just within 3 minutes of training newly initialized character-level language model, he's been able to get some coherent looking text. "we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697"

Besides that, i've also trained LM with LLaMA's architecture using popular portable tool llama.cpp, it has train-text-from-scratch example, that allows to train(on CPU) newly initialized model with your settings(dim, heads, layers) on your .txt dataset file, so i've tried to test the same model settings, to train it on basic text dataset, so i've has setup the same architecture (dim 512, num_heads 8, num_layers 8, ctx 300) and after 20 minutes of training on CPU(!), i've got some basic model that was imitating the style of the text that it was trained on(i just used random json file as the dataset).

Here's the command that i'm using to start the training: dora run solver=musicgen/musicgen_base_32khz model/lm/model_scale=my_nano dset=fullen/fullen133h dataset.batch_size=2 conditioner=none optim.ema.use=false dataset.num_workers=4 checkpoint.save_every=10 optim.updates_per_epoch=2000 dataset.valid.num_samples=2 dataset.evaluate.num_samples=2 dataset.generate.num_samples=10 generate.every=5 autocast=false dataset.min_segment_ratio=1.0 dataset.segment_duration=6 schedule.lr_scheduler=null optim.optimizer=adamw optim.lr=3e-4

And this is the samples that it's producing after 120 epochs, alongside with some of the dataset examples. generated_samples_epoch120.zip GPU that i'm training on: GTX 1060 6GB. (I know, but still, do 20 hours of training of this small model is "just not enough"?)

I just want to know, why it's evolves so slow, while the other 2 examples of my limited experience of training transformers was doing good progress after just ~30 minutes, and with musicgen, i'm getting nowhere after 20 hours.

Graphs seem to look good, as i said, it's just super slow. Tensorboard

tomhaydn commented 1 month ago

Hey @DEBIHOOD I'm just reaching out as a stab in the dark to ask how you structured your dataset. The docs aren't very clear and you seem to have managed to get it training on your own dataset. Please if you have a few minutes could you assist me with my issue:

https://github.com/facebookresearch/audiocraft/issues/462

DEBIHOOD commented 1 month ago

@tomhaydn Sure thing! I think i should have files for dataset still on my PC, or at least the config files for the dataset, so i will respond with the proper explaination once i figure it out. Regarding my own problem about it being slow to train, i've not done much more tests after that, but i guess it's just how transformers is in general, and even this config with "about a size of a insects brain" (34M params), it's gonna be slow to train without having a few high end GPUs. And the success of other things that i mentioned in my issue is due to the fact that there, single word was tokenized to maybe 1-2 tokens, so it's easier to get something coherent when we're working on that scale of tokens, but musicgen is operating at 50 tokens per 1 second of audio/music generated, so it should learn a lot more structured and long-dependent relationship of tokens, before it actually starts to sound anything like music/your dataset.

Just out of curiosity, what are your hardware that you're planning to train it on, and about your dataset? (How many hours of data?, conditional or not?)

tomhaydn commented 1 month ago

hi @DEBIHOOD thanks so much for your response. I have paid access to >500GB VRAM, I'm initially testing on 100 hours of music, then plan to increase it to 800 hours open source + 200 hours of licensed music that I have exclusive access to.

I'm not concerned with conditional generation (i.e. full prompt based generation), but I do need to in some respect, similar to your requirements. I.e. using some combinitation of artist Name, Key, Tempo so that I can prompt those when generating.

Since your other reponse I've managed to get the model training, yet to test the output, but it is incredibly slow. The docs show 30mins per epoch is standard, which, if they completed their full 500 would have taken a week to train.

I'd love to connect and discuss some of my ideas if you're up for it!

DEBIHOOD commented 1 month ago

@tomhaydn

I'd love to connect and discuss some of my ideas if you're up for it!

Sure! Do you think we should create another issue regarding that, or do you have any other ideas in mind? There's some stuff about training that i have discovered in my own experiments that probably will be helpful.