Takaaki-Saeki / zm-text-tts

[IJCAI'23] Learning to Speak from Text for Low-Resource TTS
Apache License 2.0
64 stars 2 forks source link

Interpretation of Monolingual CER #2

Closed iamanigeeit closed 10 months ago

iamanigeeit commented 1 year ago

Hello,

Thanks for uploading the pretrained model!

I have a method to manipulate TTS output using inference tricks to produce unseen languages, and i'm doing it for English to Mandarin. It is difficult to compare results because i use no text or audio data in the target language, only phoneme conversion and pitch/duration editing. The source model is also monolingual (English only).

How should i interpret the "Monolingual" CER in your paper? I am wondering why the CER is so high if all we do is normal TTS training from IPA input to audio output. My manipulated English model produces Mandarin CER (using pinyin) better than all the "IPA monolingual" languages except de.

iamanigeeit commented 1 year ago

I suspect Tacotron2 is hard to train on limited data (<20 hrs audio for most of CSS10 languages) and will cause output words to be skipped or repeated.

Takaaki-Saeki commented 1 year ago

Hi, thanks for the comments. As you mentioned, I also think that the amount of training data (and data quality) is not sufficient to train the Tacotron2 model. If you are using a non-autoregressive model (e.g. Fastspeech2 or VITS), your model might show Mandarin CER better than our monolingual IPA models. So it would be better to compare your results with the same model and the same experimental settings.