Open aayushkubb opened 4 years ago
What do you mean exactly with aligning the audios? With the script extract_durations.py you will generate a dataset for the forward model using the predictions of the autoregressive model. If you add the flag useGT you will use the ground truth mels (extracted directly from the wavs) as target for training the forward model, otherwise (recommended) you will use the predictions of the autoregressive model as target. Hope this helps.
Hey thanks for your reply, just one more check. IF my autoregressive model is not good, then how much impact will it make to the forward model?
In that case what is more advisable to use? GT mels or predicted mels?
Hi, to evaluate you autoregressive model FOR the alignment extraction, you have to look at the last layer attention heads of your TRAINING SET. If these do not show significant jumps or collapses, then it will be OK, regardless how good your out of set predictions are (because the training set alignments are obtained with teacher forcing). According to the literature, predicted mels (which I believe is the corresponent of sequence level knowledge distillation) are to be preferred.
Hey, What steps should we use to allign the audios(non english). I see there is something called "Compute alignment dataset" which you guys use for the forward model.
What exactly does that help in and there are two types of mel one is predicted and other is GT. IF we are training from scratch i assume we should add use_GT when running extract_duration.py