as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
577 stars 113 forks source link

confuse about duration extract #64

Closed thanhlong1997 closed 2 years ago

thanhlong1997 commented 2 years ago

Hi Thank you for great repo. I am trying to implement version for multi speaker since i already have tacotron2 multi speaker for aliment extraction. Bụt when i extract aliment and duration by command python train_tacotron.py --force_align, I have confused that , mel_len is len of ground truth mel spectrogram but aliment matrix is relation between input text and predicted mel spectrogram by tacotron. So this mel_len will mismatch to aliment attention matrix. I saw that u use this line to bring attention matrix have the same len to mel_len path_probs = 1.-att[:mel_len, :]. And this make me confuse, pls explain why this mel_len not be len of predicted mel spectrogram ?

Thank sir

cschaefer26 commented 2 years ago

Hi, first of all good luck with the multispeaker implementation, I actually have played around with it (branch multispeaker, very old). Regarding your question, usually the restriction to mel_len is there in case one wants to do batched inference (to remove the paddings). For batch size=1 the lengths should match.

thanhlong1997 commented 2 years ago

Oh. Your code extract aliment from tacotron by pass forward not by inference. That why we need get aliment matrix up to mel_len element for removing the padding value. When I replace your tacotron model, i thought aliment matrix must be extracted by running inference and i get the warning "Sum of durations did not match mel length". Now i know why that happen Thank you for the explain

thanhlong1997 commented 2 years ago

Can I let the issues open until i finish the implement ?

Thank sir

cschaefer26 commented 2 years ago

Surely, keep me updated!

thanhlong1997 commented 2 years ago

Hi, i am successful implement Forward tacotron multispeaker version. The result sound good. Still we are using pretrained tacotron or with my version is mellotron to extract aliment between character and melspectrogram, but in fastpitch and fastspeech i see they now use montre to extract aliment. Have you tried it before ? will it better or worse than using tacotron to extract? I have tried using it in fastpitch but still the result is pool

cschaefer26 commented 2 years ago

Hi, I tried using the MFA for duration extraction before and I found it to be slightly worse, there was also quite some fiddling involved with mapping the phonemes. It wasn't totally bad though.

thanhlong1997 commented 2 years ago

thank sir one more question, now i am using grapheme for encoding text, since i have to work with multilingual data. Will use phoneme is better or not ? in fastpitch too, i saw they using both grapheme and phoneme

cschaefer26 commented 2 years ago

In my experience phonemes trump graphemes big time because of the bijective nature of phonemes that is easier to learn for the tts model. Result is much more stable pronunciation and prosody. For multilingual data simply phonemize your metafile upfront and set use_phonemes=False in the config. You could checkout https://github.com/as-ideas/DeepPhonemizer for a stable phonemizer (you might need to train your own phonemizer model for your needs, but its probably worth it)

thanhlong1997 commented 2 years ago

thank you sir for your advise, since i use phoneme instead of graphemes then the result is promising. Now i would to deploy this model to device like mobile phone or camera, but the model have a big size so i must compression them, but the result is not good. Have you done it before ? what can i do if i want to deploy to device and still keep its quality? Thank sir

thanhlong1997 commented 2 years ago

do you think non auto-regressive model like forward tacotron or fast speech2 is not as good as auto-regressive model like tacotron2 ? when i try bot forward tacotron and fast speech on same dataset, i found that the audio generated is not good as result by tacotron2, even it outperform tacotron2 in speed. I am trying to improve your model and fast speech 2 for comparable to tacotron2 but it seem too hard to do this