enhuiz / vall-e

An unofficial PyTorch implementation of the audio LM VALL-E
MIT License
2.93k stars 417 forks source link

How to Pretrain on LibriTTS #24

Open raikarsagar opened 1 year ago

raikarsagar commented 1 year ago

Hi,

Its great to see the implementation of a recent work and appreciable. I was able to setup the training with custom data for single speaker. Following are some of the queries:

  1. What is the sample rate of the training set which is supported? The synthesized audio seems to be having 24k sample rate and single channel ? In emb/qnt.py : only first channel is chosen but there isnt any check on sample rate.
  2. For Pretraining on multiple speaker with LibriTTS data: What is the recommendation on number of epochs and batch size?
  3. Can we pretrain on single speaker dataset? i.e using LJSpeech data.
  4. Is direct finetuning on limited single speaker data recommended? Any suggestions here would help.

Thanks in advance Sagar

enhuiz commented 1 year ago

Hi Sagar,

  1. I'm using the 24k EnCodec model, which is a mono-channel model.

https://github.com/enhuiz/vall-e/blob/3476d393d2133fa9b50d5ad999ca13b95fc22060/vall_e/emb/qnt.py#L22

And all audios will be resampled to the model's sampling rate (i.e., 24k):

https://github.com/enhuiz/vall-e/blob/3476d393d2133fa9b50d5ad999ca13b95fc22060/vall_e/emb/qnt.py#L63

  1. It's not clear to me yet. The authors suggest the following in their paper.

The models are trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic tokens per GPU for 800k steps. We optimize the models with the AdamW optimizer, warm up the learning rate for the first 32k updates to a peak of 5 × 10−4, and then linear decay it.

If you have enough computation power, it's better to use their configuration.

3 \& 4. It's hard to give suggestions as I haven't tried the way you mentioned. But for cloning the speaker's voice, choosing a dataset with more speakers sounds better as you want the model to learn from the prompt.

raikarsagar commented 1 year ago

Hi @enhuiz ,

Thanks for the inputs. Have you tried experiments to finetune on a particular speaker? or what are the plans on pretrained model release?

TechyChan commented 1 year ago

@enhuiz Thanks for this amazing implementation! Regarding 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic tokens per GPU, does this mean we should specify a batch_size: 6000 on the ar.yml file? I tried that, but even on a Nvidia A100, I don't think there's enough memory for it...

Can you share what config you've had the most luck training with LibriTTS so far? I trained to 200k steps with smaller batch_size, but the results are not very good.

tfriedel commented 1 year ago

How long would training take, for example in the 16x V100 case?

raikarsagar commented 1 year ago

Hi, I was able to train AR and NAR model on LibriTTS dataset for 180k steps each. But audio quality is too off in terms of pronunciations. Could you please point out any major configurational change we need to check? Not sure on what is being missed.

Thanks in advance Sagar

davtoro commented 1 year ago

@raikarsagar

You are missing around 59.5k hours of training data. Read the paper, main thing of vall-e besides using audio codec is the amount of data they trained on which is around 100x more than architectures before. And you have 0% guarantees that code in this repository will replicate vall-e performance, as this is unofficial implementation, which could be missing "magic sauce" omitted in paper.

MarkWuNLP commented 1 year ago

How long would the training take, for example in the 16x V100 case?

It takes 4 days with 16x 32G V100. I did not train it with LibriTTS, but LibriLight-small with 6000 hours of data seems good for VALLE training.

rpowalski commented 1 year ago

@MarkWuNLP is there any chance you could share the weights for the model you trained on LibriLight-small?

d-warfield commented 1 year ago

@MarkWuNLP it would be so awesome if you could share the weights and your code - just want to test and see the quality

MisakaMikoto96 commented 1 year ago

Hi, I tried on a small single particular speaker data (about 4 hours), I find that it is also very difficult to fit the data and could not get the intelligibility.

JaejinCho commented 1 year ago

How long would the training take, for example in the 16x V100 case?

It takes 4 days with 16x 32G V100. I did not train it with LibriTTS, but LibriLight-small with 6000 hours of data seems good for VALLE training.

@MarkWuNLP Thanks for sharing the time estimate for training. BTW, were 4 days for training w/ 6k hours of data or w/ the whole 60k of Libri-Light?

airpdev commented 1 year ago

Hello, I think that it is very useful code implementation. But trained model with some data is generating just noise not voice. Would you like to let me know what is the reason? Looking forward to hearing from you. Regards! Petar

agupta54 commented 1 year ago

LibriLight-small data is is only 577 hours. And if it is clipped according to the voiced chunks timetamps given with the dataset it reduces to 183 hours. Am I missing something here?

CopyNinja1999 commented 1 year ago

How long do you usually load the data? I have approximately 3000h data which takes more than 1 hour to load the data. Any ideas to speed it up?

DonkeyRats commented 1 year ago

@MarkWuNLP would also be very nice if you can share your weights and code. Not everyone has access to the ridiculous amount of computational power required to train such models

DamascusGit commented 1 year ago

@MarkWuNLP would also be very nice if you can share your weights and code. Not everyone has access to the ridiculous amount of computational power required to train such models

i have the compute, just not the data. happy to host GPUs to train this with anyone interested.

jonathanrbarney commented 1 year ago

@MarkWuNLP would also be very nice if you can share your weights and code. Not everyone has access to the ridiculous amount of computational power required to train such models

i have the compute, just not the data. happy to host GPUs to train this with anyone interested.

Are you looking for the Libri TTS dataset? You can find there here: https://www.openslr.org/60/

DamascusGit commented 1 year ago

@MarkWuNLP would also be very nice if you can share your weights and code. Not everyone has access to the ridiculous amount of computational power required to train such models

i have the compute, just not the data. happy to host GPUs to train this with anyone interested.

Are you looking for the Libri TTS dataset? You can find there here: https://www.openslr.org/60/

Looking for a dataset with many more hours than this, i believe VALL-E was 60k hours. of course i don't imagine anyone can just throw in a link with a 60k hour dataset, but something far larger than ~600 hours would be great.

cparks1 commented 1 year ago

@MarkWuNLP would also be very nice if you can share your weights and code. Not everyone has access to the ridiculous amount of computational power required to train such models

i have the compute, just not the data. happy to host GPUs to train this with anyone interested.

Are you looking for the Libri TTS dataset? You can find there here: https://www.openslr.org/60/

Looking for a dataset with many more hours than this, i believe VALL-E was 60k hours. of course i don't imagine anyone can just throw in a link with a 60k hour dataset, but something far larger than ~600 hours would be great.

@DamascusGit https://github.com/facebookresearch/libri-light/tree/main/data_preparation Libri-light large is very close to 60k hours.

cparks1 commented 1 year ago

Does anyone know how long training VALL-E with LibriLight Small (577 audio hrs, 183 audio hrs if clipped to timestamps, 35 GB data) might take if you were training with 8x NVIDIA A100 40 GB VRAM?

I'm trying to determine what it would cost to run training on the cloud.

Using Lambda Labs' training TacoTron 2 benchmarks, I can see the V100 with a speedup factor of 7.37 with x8 and a speedup factor of 3.57. It seems that doubling the GPU means the speedup factor would approx. double, so x16 V100 would have a speedup factor of approx 15.17.

The A100 x8 has a speedup factor of 13.75.

According to @MarkWuNLP it took 4 days to train on LibriLight Small with x16 V100.

Using the speedup factors provided in Lambda Labs' benchmarks, the A100 x8 would run at 13.75/15.17 = 0.90 times the speed of x16 V100, meaning it would take approximately 4.4 days, or approximately 105.9 hours.

Was hoping someone more experienced with ML could tell me if using these benchmarks to compute a rough estimate would be valid.

caleb272 commented 1 year ago

Does anyone have a public model?

cantabile-kwok commented 1 year ago

I also cannot train this implementation effectively on LibriTTS, regardless of much efforts in modifying the code and experimenting. 10+ days on multiple GPUs to train a quarter model on LibriTTS-clean part is still not enough to obtain reasonable performance. The generated audio is just babbling and none of the words are pronounced clearly.

Usually a middle-sized model should have a reasonable performance on LibriTTS-clean, and the training should not be that hard. There is indeed something strange that so many people are having trouble on training.

anuraagvaidya commented 1 year ago

Can we crowdfund the training of this? I can pitch in a few hundred. On Vast 8X A100 is $8.746/hr.

DamascusGit commented 1 year ago

Can we crowdfund the training of this? I can pitch in a few hundred. On Vast 8X A100 is $8.746/hr.

yes, happily, happy to coordinate on discord

thisserand commented 1 year ago

Just found this one and thought it could be of interest for all of you as well: https://github.com/lifeiteng/vall-e/issues/58

p-w-rs commented 1 year ago

I am a professor at a university (MSOE) and would be willing to train this for free (I am not an expert in the deep learning domain though) if our cluster can handle it in a reasonable amount given some conditions:

  1. The model weights are given freely for others to use
  2. MSOE is attributed for the "donation" of compute
  3. We do some sort of smaller data training so that we have a high confidence we will achieve good results (I don't want to take a couple of days of cluster time and end up with nothing)
  4. We are able to monitor the training to be reasonable confident it is actually improving over time
  5. I would have to actually run the code myself, do any of the installation of programs, etc since I can't give cluster access to anyone else

I think it would be really great to have readily usable open source implementations since most people will never have access to large amount of compute. Here are our cluster details: https://msoe.dev/#/about

thisserand commented 1 year ago

I am a professor at a university (MSOE) and would be willing to train this for free (I am not an expert in the deep learning domain though) if our cluster can handle it in a reasonable amount given some conditions:

  1. The model weights are given freely for others to use
  2. MSOE is attributed for the "donation" of compute
  3. We do some sort of smaller data training so that we have a high confidence we will achieve good results (I don't want to take a couple of days of cluster time and end up with nothing)
  4. We are able to monitor the training to be reasonable confident it is actually improving over time
  5. I would have to actually run the code myself, do any of the installation of programs, etc since I can't give cluster access to anyone else

I think it would be really great to have readily usable open source implementations since most people will never have access to large amount of compute. Here are our cluster details: https://msoe.dev/#/about

@p-w-rs Thanks so much for offering the compute to provide a pre-trained version of the Vall-E model to the community. There is another repository that has implemented code for the Vall-E model. In this repository, there is already a checkpoint file that was trained for 100 epochs on the LibriTTS dataset and achieved good results. However, the text-to-speech generation capabilities are worse than those of the official Vall-E model. Personally, I think that further fine-tuning the model (from the checkpoint file) using parts or the full Libri-Light dataset would be a better way to use the offered compute, since there is already a checkpoint that has been trained on the LibriTTS dataset. The corresponding thread and checkpoint files can be found here: https://github.com/lifeiteng/vall-e/issues/58

p-w-rs commented 1 year ago

@thisserand Sorry, this is what I meant that I would be willing to run training for someones codebase using pytorch on the full Libri-Light (60k hours) data. I don't know how long that would take and would be fine to start from scratch or to fine tune a checkpoint like lifeiteng's

I can also do tensorflow but personally use pytorch. One of the nodes has 8x Tesla V100-SXM2 gpu's. We could also use more than one node if anyone has experience training across nodes. I know it is possible but I haven't done it myself. I also don't know how much it helps.

thisserand commented 1 year ago

@p-w-rs Unfortunately, I lack the time to put all the pieces together. But maybe someone else is willing to contribute :-) As a guidance, I would recommend to have a look at the Libri-Light dataset setup (https://github.com/facebookresearch/libri-light/tree/main/data_preparation). Also in the other Vall-E repository there are already examples for setting up the Libri-TTS and LJSpeech datasets (and how to train a model using them). These might help further. If I find some time, I will definitely take a look at it, but maybe someone else is willing to contribute in the meantime :-)