Delightful TTS implementation

erogol commented 2 years ago

👑 @loganhart420 is going to do the heavy lifting !!!

We can discuss here how we want to go about it.

dunky11 commented 2 years ago

Thanks for opening the issue! Pre-training the model will take me roughly one more week. Afterward, I will refactor the code, and get the project into a usable state, then I will implement this into coqui, so it will probably take me 6+ weeks.

Some info about the model: It's based upon DelightfulTTS with some modifications. Many components of the model weren't fully explained in the paper. They especially didn't get into details about both the phoneme and utterance level prosody encoders as well as hyperparameters used, so their implementations were heavily influenced by Comprehensive-Transformer-TTS. The model also uses a different scheme to provide both language and speaker embeddings. The scheme of DelightfulTTS may have worked for the blizzard challenge but didn't work when using more speakers/languages.

For the G2P model, I used DeepPhonemizer, which implements Transformer-based Grapheme-to-Phoneme Conversion, and increased the parameter count to ~23M. A single G2P model is trained on the global phone set of Montreal Forced Aligner in the following languages:

English
German
French
Castilian Spanish
Russian
Polish
Bulgarian
Czech
Croatian
European Portuguese
Swedish
Thai
Turkish
Ukrainian

Also, I increased the parameter count of DelightfulTTS to ~120M, otherwise, it would underfit the dataset. The dataset is ~20% stuff from public datasets like LibriTTS (100h and 360h split) and VCTK and ~80% stuff crawled by me. If you want to see some statistics about the dataset, you can click here.

The purpose of the model is to be fine-tuned on smaller datasets. It should provide a way to create TTS models in languages with limited data. It can also be used to code-switch: Since the model was pre-trained in English, German, French, Spanish, Russian and Polish voices it can be used to fine-tune the model on an English voice and then make it speak the other languages.

The parameter count may seem intimidating for a TTS model, but it can be fine-tuned without a problem on 6GB of VRAM using gradient accumulation. Also since the architecture is FastSpeech-based, and avoids autoregression, both training and inference are relatively quick and stable.

UnivNet is used as vocoder, but any vocoder that shares its STFT configuration should do its job.

In the future, I will further increase the size of the dataset, especially for the languages which contain no data yet. I also plan on further increasing the size of the model, since the current model still underfits the dataset. Then I will try to create a smaller model using knowledge distillation.

I really hope the model turns out fine, I will probably fine-tune it tomorrow on an English single-speaker dataset to check how well it speaks in the other languages, even though it hasn't fully pre-trained, since I'm always impatient :)

If you are interested in the progress, check out VoiceSmith (still a WIP) which provides a GUI to fine-tune the multilingual model and preprocess multilingual TTS datasets.

dunky11 commented 2 years ago

I fine-tuned the model on the voices of Twilight Sparkle (~6000 samples, My Little Pony) and Demoman (~500 samples, Team Fortress 2) now. There is definitely still a lot of work that needs to be done but I think it shows that it's possible to pre-train a model in a bunch of languages and then make it speak languages not seen in the fine-tuning dataset.

Original: Twilight: https://vocaroo.com/1dI8o6NZqi9e Demoman: https://vocaroo.com/1dokyJI1Kt0u

English (in pre-training dataset, in fine-tuning dataset): Twilight: https://vocaroo.com/19M1jbIR8aDW Demoman: https://vocaroo.com/1g0pnFJ3jqNK

German (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/15jWTR3wpSiA Demoman: https://vocaroo.com/1aInIEAfiJgb

French (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/16cc8wjg8UKn Demoman: https://vocaroo.com/1czkTPaQm8ir

Spanish (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1oUORqiRiJOY Demoman: https://vocaroo.com/1iXKvd5GaXlf

Russian (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1h6k4mUdMQ0u Demoman: https://vocaroo.com/1doXkCsPyoM5

Polish (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1eklCtLEuqPg Demoman: https://vocaroo.com/12om7gCZH5Zd

The languages below should not work since the model has not seen them in both pre-training and fine-tuning, I will include them anyway.

Bulgarian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1dC5qGhPn4op Demoman: https://vocaroo.com/1epSvuv04Tug

Czech (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1f8yMHYHvFwo Demoman: https://vocaroo.com/19zHsVku5My1

Croatian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1dAlyJqd6pNj Demoman: https://vocaroo.com/1o0qrTSGiMwb

European Portuguese (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1d9K6XegabUO Demoman: https://vocaroo.com/1h7jCNhbuAoG

Swedish (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/15ma7yF5OhZa Demoman: https://vocaroo.com/1cP4I6JOsfqX

Thai (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1lIZW5rT8sbc Demoman: https://vocaroo.com/12YFcNHHnDzH

Turkish (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1h56wternaDp Demoman: https://vocaroo.com/1gIOQTY6mPne

Ukrainian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/150bIrTIDLaa Demoman: https://vocaroo.com/14PmdpuAHOKj

dunky11 commented 2 years ago

I also noticed the multilingual G2P stuff and unusual phone set (phone set from Montreal Forced Aligner) will probably make it a pain to implement this into coqui. It's probably better to implement the English-only version which was trained on ARPABET, I will develop that one alongside the multilingual one anyway. I have a colab for inference of that one here.

erogol commented 2 years ago

Samples above are impressive. TR samples are like someone German speaking TR :)

Can't we just use espeak for G2P? What are the benefits of using a neural G2P model?

dunky11 commented 2 years ago

Just noticed coqui already has support for multiple languages, that is nice. It doesn't really matter which G2P model we use, we just need a way to extract the phoneme durations using a forced aligner for example. I see you implemented FastSpeech and FastPitch. How did you extract phoneme durations? Or did you differ from the original implementation using unsupervised durations (since you implemented https://arxiv.org/pdf/2108.10447.pdf)?

erogol commented 2 years ago

We learn durations unsupervised in different ways one of which is that paper. It is called AlignerNet in 🐸TTS.

lucasjinreal commented 2 years ago

Hello, guys. I saw there so many languages, but just not Chinese, does there any plan to support Chinese?

neurlang commented 2 years ago

Czech is understandable.

erogol commented 2 years ago

👑 @loganhart420 will continue implementing this.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

lucasjinreal commented 2 years ago

Does it avaiable for now

loganhart02 commented 2 years ago

Does it avaiable for now

I'm currently training the pretrained models, the pr to follow along is here: https://github.com/coqui-ai/TTS/pull/2095

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol commented 1 year ago

Still WIP by @loganhart420

agilebean commented 1 year ago

Can someone please summarize the differences between Voicesmith by @dunky11 and the current repo of TTS by coqui-ai?

ludwikbukowski commented 1 year ago

Polish is awesome! When would it be ready? Can I download the model even if its WIP ?

loganhart02 commented 1 year ago

Polish is awesome! When would it be ready? Can I download the model even if its WIP ?

you can clone the branch right now and train from scratch. I've got a working ljspeech and vctk model working so it should work for single and multi speaker datasets.