DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.43k stars 160 forks source link

What all needs to be done to disable grapheme to phoneme conversions? #5

Closed michael-conrad closed 2 years ago

michael-conrad commented 2 years ago

What all needs to be done to disable grapheme to phoneme conversions?

Flux9665 commented 2 years ago

So you want to use Character Embeddings rather than Phoneme Embeddings? In that case you can change this line https://github.com/DigitalPhonetics/IMS-Toucan/blob/931e4ce63a4cc675cb15b72474a3c3619632a07b/Preprocessing/TextFrontend.py#L131 to the following:

phones = text

And exchange the phones in https://github.com/DigitalPhonetics/IMS-Toucan/blob/931e4ce63a4cc675cb15b72474a3c3619632a07b/Preprocessing/ipa_list.txt to all the characters that you want to accept as input. The first line needs to remain the padding token though.

And then when you instanciate a model, you need to specify the new amount of possible inputs as the models idim. For that you can take the amount of lines in the ipa_list.txt file +1.

Generally phoneme embeddings work much better than pure character embeddings, so I wouldn't recommend changing to characters. If you want to change it to make it work for a different language, consider that espeak supports phonemizing in 126 languages at this point. You just need to exchange the shorthand of the phonemizer for that of a new language. A bunch of them are already added, but there's a lot more you can look up here: https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

michael-conrad commented 2 years ago

Thanks for the help.

espeak-ng does not correctly handle Cherokee. Even if I were to try again to fix the issues it has with improper stress marks, I do not think I will be able to get it to produce IPA for the tonal portion of the audio. I generally consider espeak-ng broken for non-latin languages, especially for those which don't use stress such as Cherokee.

The orthography I plan on using is called the "Modified Community" orthography and properly indicates cadence (vowel length) and tone.

Do you think it would be possible to fine tune an existing model and keep the previous voice embeddings intact while adding new voices?

Flux9665 commented 2 years ago

I see, very interesting language. The modified Community orthography should work, adding a lexical tone marker as character into the sequence has worked for Chinese with the Pinyin writing system.

Maybe this could be an alternative to espeak for non-latin languages that still uses IPA? https://aclanthology.org/P16-1038.pdf

Do you think it would be possible to fine tune an existing model and keep the previous voice embeddings intact while adding new voices?

It should be possible to fine-tune an existing model, but if you fine-tune across writing systems you'd need to either map the new sounds to their closest counterparts in IPA (just place it in the corresponding line in the ipa_list.txt file) or load the encoder and decoder parts of an existing model, but replace the embedding layer with a new one and train that one from scratch.

Do you mean voice embeddings as in multiple voices in multispeaker TTS? In the setup we use here, we provide the voice embedding as input for multispeaker models, so you would just fine-tune the conditioning. Multispeaker is however really challenging, we are moving away from multispeaker entirely in a future update. 30 minutes of single speaker data is enough to finetune a high-quality single speaker TTS, so multispeaker TTS doesn't seem that necessary: https://arxiv.org/abs/2110.05798

michael-conrad commented 2 years ago

So, to have multiple voices available, would require multiple "donor" models?

Flux9665 commented 2 years ago

Yes, one model per voice. The quality is much better that way, so I think it's worth it.

michael-conrad commented 2 years ago

Do you have a guesstimate on when the new approach will be published?

Flux9665 commented 2 years ago

Unfortunately not, we have to wait for an anonymity period to run out. But you can do the same with the current state of the toolkit, the update just removes the multispeaker stuff and changes some things about the way inputs are handled. This will be incompatible with what you want to do anyway. The current state of the toolkit will remain in another branch for backwards compatibility, the main branch will change to contain the new additions.

So if you want to train a Cherokee model, I suggest you train a model with the Modified Community orthography on a related language where you have a single speaker dataset with at least 5 hours available. Then you take this model as starting point and finetune on the most data you have from a single speaker in Cherokee. If you have around 30 minutes it should probably work pretty well. It should also work with less, but it gets more tricky the less you have.