facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.16k stars 6.37k forks source link

fine-tuning MMS TTS models #5184

Open taalua opened 1 year ago

taalua commented 1 year ago

Hi,

How to fine-tune MMS TTS models. I used the default vits code, however, i had issues when resuming from the existing optimizer state dict: " in adamw expavg.mul(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (38) must match the size of tensor b (178) at non-singleton dimension 0 "

Please help. Thanks.

chevalierNoir commented 1 year ago

@taalua This is probably due to the mismatch in vocabulary between the original VITS code and ours. The vocabulary VITS uses is hard coded here, which were used to get symbol-to-id mapping, while we use a different vocabulary per language, specified in vocab.txt. You can use this to get text mapping for MMS TTS models and use its id_to_symbol instead.

patrickvonplaten commented 1 year ago

BTW, we're working on making this very easy in transformers. You can check:

ravsau commented 1 year ago

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

chevalierNoir commented 1 year ago

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

CopyNinja1999 commented 1 year ago

is there a guide on adding a TTS language? I'm thinking of adding Nepali which has Language ID and ASR but no TTS

Most of the VITS code remains unchanged. You only need to define the vocabulary of the new language (i.e., a list of characters used in the new language) and use that as the symbols here.

@chevalierNoir Eng model is working with a random discriminator checkpoint however I met this error when fine-tuning Kor model:

packages/bitsandbytes/optim/optimizer.py", line 455, in update_step
    if state["state1"].dtype == torch.float:
KeyError: 'state1'

I cannot find out why the two models ain't behave the same way. The main difference from my perspective is whether the checkpoint has pre-trained optimizer states or not.

chevalierNoir commented 1 year ago

@CopyNinja1999 Did you download the full model checkpoint (including generator, discriminator, optimizer states) for fine-tuning as is suggested here? eng and kor should be of the same format.

CopyNinja1999 commented 1 year ago

@chevalierNoir Thanks for your reply! I find out that this error was caused by the bnb optimizer wrapper from this repo https://github.com/nivibilla/efficient-vits-finetuning. About the full model checkpoint, yes, I tested them yesterday using romanizer https://github.com/osori/korean-romanizer and this dataset https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset. The audio synthesised became pure noise after fine-tuning. (however fine-tuning works in eng model) Do you have any hint why?

CopyNinja1999 commented 1 year ago

btw, what is the romanizer you use for all the languages?

CopyNinja1999 commented 1 year ago

Update:resample the audio from 44khz to 22.05khz and worked.

chevalierNoir commented 1 year ago

btw, what is the romanizer you use for all the languages?

In case it's needed, our romanizer is this. Note we only do uromanization for ~5 languages with a large character vocabulary. Otherwise using raw characters achieves slightly better performance.

patrickvonplaten commented 1 year ago

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

andergisomon commented 1 year ago

We now have a super simply fine-tuning script in Transformers: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#connectionist-temporal-classification-with-adapters

how about TTS?

patrickvonplaten commented 1 year ago

Working on it cc @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Adding the models first to the library in https://github.com/huggingface/transformers/pull/24085, then will add training functionality in a second step 🤗

qunash commented 1 year ago

Adding the models first to the library in huggingface/transformers#24085, then will add training functionality in a second step 🤗

Can't wait! 😀

taalua commented 1 year ago

Hi @chevalierNoir @patrickvonplaten

for English, I see vocabulary: '1', '5', '6' from the list, what does it mean for each? also what's the difference between '–' and '_' ?

FYI, English vocabulary: ['k', "'", 'z', 'y', 'u', 'd', 'h', 'e', 's', 'w', '–', '3', 'c', 'p', '-', '1', 'j', 'm', 'i', ' ', 'f', 'l', 'o', '0', 'b', 'r', 'a', '4', '2', 'n', '_', 'x', 'v', 't', 'q', '5', '6', 'g']

Thanks

kdcyberdude commented 1 year ago

Any update on this?

Salama1429 commented 1 year ago

please update about adding training functionality @sanchit-gandhi

arbianqx commented 11 months ago

any update on this? @sanchit-gandhi

owos commented 11 months ago

Just stopping here for an update. could anyone please help me with a TTS finetuning codebase?

ylacombe commented 9 months ago

I've released this repository to allow VITS/MMS finetuning with transformer compatibility: https://github.com/ylacombe/finetune-hf-vits Feel free to check it :hugs: