as-ideas / TransformerTTS

🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.
https://as-ideas.github.io/TransformerTTS/
Other
1.12k stars 225 forks source link

Audio Alignment #36

Open aayushkubb opened 4 years ago

aayushkubb commented 4 years ago

Hey, What steps should we use to allign the audios(non english). I see there is something called "Compute alignment dataset" which you guys use for the forward model.

What exactly does that help in and there are two types of mel one is predicted and other is GT. IF we are training from scratch i assume we should add use_GT when running extract_duration.py

cfrancesco commented 4 years ago

What do you mean exactly with aligning the audios? With the script extract_durations.py you will generate a dataset for the forward model using the predictions of the autoregressive model. If you add the flag useGT you will use the ground truth mels (extracted directly from the wavs) as target for training the forward model, otherwise (recommended) you will use the predictions of the autoregressive model as target. Hope this helps.

aayushkubb commented 4 years ago

Hey thanks for your reply, just one more check. IF my autoregressive model is not good, then how much impact will it make to the forward model?

In that case what is more advisable to use? GT mels or predicted mels?

cfrancesco commented 4 years ago

Hi, to evaluate you autoregressive model FOR the alignment extraction, you have to look at the last layer attention heads of your TRAINING SET. If these do not show significant jumps or collapses, then it will be OK, regardless how good your out of set predictions are (because the training set alignments are obtained with teacher forcing). According to the literature, predicted mels (which I believe is the corresponent of sequence level knowledge distillation) are to be preferred.