Mispronunciation during adaptation

Hello,

I have been trying to perform adaptation on a smaller dataset after pretraining tacotron2 on a larger dataset. The baseline model obtained (before starting the adaptation process) is able to synthesize with the right pronunciation. I performed adaptation on smaller data (about 60 mins) with a lower learning rate. After adaptation, the style of the adaptation data is replicated well, however, certain words are not produced correctly.

To understand the issue further, I moved around the words which are not pronounced correctly to determine if it was the contextual relationship that was resulting in mispronunciation. For example, while synthesizing the sentence - "It's more than just a sandy waste", the word "waste" is pronounced as "laste". To study this issue further, I synthesized another sentence where waste occurs at a different location - "Tons of waste covered the sandy beach". Here the word "waste" is pronounced correctly. So in the next stage, synthesized some more sentences where the word "waste" appears at the end. In all these cases, the word was not pronounced correctly.

I can provide you more details in case you are interested. Any insights into what might be causing this issue are greatly appreciated :)

Thank you!

NVIDIA / tacotron2

Mispronunciation during adaptation #495