Guided attention usage while training

CookiePPP / cookietts

[Last Updated 2021] TTS from Cookie. Messy and experimental!

BSD 3-Clause "New" or "Revised" License

43 stars 8 forks source link

Guided attention usage while training #34

Closed kannadaraj closed 3 years ago

kannadaraj commented 3 years ago

Hi..

I see that you have implemented guided attention loss using trying the outcome to force to be a diagonal. Isn't this is more of a lossy way of performing alignment? Instead woudl it better to generate pre-generated alignments using Forced alignment information and use that to calculate the loss wrt the ground truth alignment graph. For e.g. like in the paper https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8703406 .. Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis.

Does this approach result in better alignment learning as it is more simpler and robust than the diagonal forcing based approach. I have used the Pre-Aligned attention. I can add Pre-Aligned attention based approach to this repo instead of diagonal forcing. What are your thoughts?

CookiePPP commented 3 years ago

Pre-Aligned attention is definitely more effective than Diagonal attention loss for producing a stable model.

There are 3(?) reasons I haven't rushed to use PAG. 1 - I want to use Graphemes as well as phonemes, all forced-alignment systems I know use phonemes for obvious reasons 2 - Some of my datasets are noisy so it turns out letting tacotron naturally attend to multiple tokens lets you use the alignment as an indicator of background noise in the spectrogram. Pretty useful for automatically reducing output noise 3 - It's really easy to just take multiple attempts at the same text and pick which ones are stable, much less effort than trying to inferface with a forced-aligner written in another language and probably not built for highly emotive speakers 😅

https://github.com/CookiePPP/cookietts/blob/master/CookieTTS/_5_infer/t2s_server/text2speech.py#L598 (here I use the alignment strength to pick which spectrogram has the least noise)

kannadaraj commented 3 years ago

Cool.. Thanks for detailed reply.
Just to understand when you use both graphemes and phonemes are the entire input sentences are at graphemes/phonemes or do you also have jumbled up representation comprising graphemes+phonemes within an utterance. like utt1:"this is great" utt2:"DH IH S IH Z G R EY T"

   do you also have hybrid representations like-
   utt3:"DH IH S _ IH Z _ great"

Also in your experience does Lexical Stress levels help or would you suggest to neglect?

CookiePPP commented 3 years ago

I use hybrid representations whenever the word is not in the pronunciation dictionary. https://github.com/CookiePPP/cookietts/tree/master/CookieTTS/dict

Also in your experience does Lexical Stress levels help or would you suggest to neglect?

I have not tested this, not really sure.

kannadaraj commented 3 years ago

Thanks for your comments and suggestions.