KevinMIN95 / StyleSpeech

Official implementation of Meta-StyleSpeech and StyleSpeech
MIT License
241 stars 39 forks source link

Cannot Reproduce quality of pretrained model #18

Closed chazo1994 closed 2 years ago

chazo1994 commented 2 years ago

I have trained a stylespeech model use LibriTTS, but the quality was far worse than pretrain stylespeech model of author. I use default config and parameter and train the model within 100k step. The loss like bellow: image image

I also upload my audio sample of the text same as demo page in folder Train_LibriTTS_StyleSpeech in attached fille. There are always strange sounds at the end of each audio file, i can't explain that. meta_stylespeech_results.zip

KevinMIN95 commented 2 years ago

I think you should train the model on more steps. In my experiments, the mel loss decreased to 0.22~0.23.

chazo1994 commented 2 years ago

I think you should train the model on more steps. In my experiments, the mel loss decreased to 0.22~0.23.

Thanks, I will continue training to 400k step and show the results. But I still confused why I use same dataset, default parameter and train to 100k step like step in pretrained model but loss and quality is worse.

chazo1994 commented 2 years ago

I Found that, the problem come from Montreal Forced Aligner version 2.0.0a22 or newer version, which do not put “sp” or “sil” in the phone tier. To fix this, just add "--disable_textgrid_cleanup" flag during alignment step.

KevinMIN95 commented 2 years ago

Oh. Thanks for finding the problem! I should update the repo to match with libraries of the newer versions.