Open chazo1994 opened 4 years ago
@chazo1994 I tried to follow the idea. Here is my repo And the Working(?) Colab file.
I take the alignments from the LJSpeech pretrained model. In future I will try Montreal Force-Aligner or Mellotron to generate alignments since LJSpeech model does not provide very accurate alignments and will decrease quality when using different speakers.
Edit: The paper also uses L1 Loss on the Decoder, do you have any idea why that is or if it would effect results? I'll leave it as MSE for now, but maybe worth testing another time.
@chazo1994 I tried to follow the idea. Here is my repo And the Working(?) Colab file.
I take the alignments from the LJSpeech pretrained model. In future I will try Montreal Force-Aligner or Mellotron to generate alignments since LJSpeech model does not provide very accurate alignments and will decrease quality when using different speakers.
Edit: The paper also uses L1 Loss on the Decoder, do you have any idea why that is or if it would effect results? I'll leave it as MSE for now, but maybe worth testing another time.
You could try to use L1 loss, but i think MSE is good. Using LJSpeech is not good idea, because LJspeech dataset only have short audios, so the alignment for long sentence is not good. You should try to follow the method of this paper.
@CookiePPP I think even if you use Montreal Force Aligner, you still face Whitespace problems.
@CookiePPP thanks for sharing.. it is good idea to use Montreal Forced aligner.. Once we get the phoneme segmentation information; how so you generate the alignment information our of that to give it as input during training?
@kannadaraj I've not ran Montreal Forced aligner before so I'll get back to you if/when I use it. I'd like to look into Mellotron first since I wouldn't have to modify much and Mellotron can already align both graphemes and phonemes. edit: I'm distracted looking at Jukebox atm.
@chazo1994 I tried MFA and encountered the problem with whitespaces as well as with missed punctuation, so I just wrote script to restore missed parts. I'm comparing word-level representation from MFA with my text so I am able to guess where missed symbols have to been put. In my implementation script takes some time from nearest symbol to restore missed one. But I'm not sure it's a good idea anyway, haven't tested it yet.
@kannadaraj I used this code
def convert_to_matrix(phonemes, duration, sample_rate, window, hop):
samples = int(duration * sample_rate)
alignments = np.zeros((samples, len(phonemes)), dtype=np.int32)
for i, phoneme in enumerate(phonemes):
start = int(sample_rate * phoneme.start)
end = int(sample_rate * phoneme.end)
alignments[start:end, i] = 1
spect_alignments = align_as_spect(alignments, hop, window)
return spect_alignments
def align_as_spect(alignments, hop, window):
samples, phonemes = alignments.shape
pad = window // 2
mel_length = (samples + 2 * pad - (window - 1) - 1) // hop + 1
mel_alignments = np.zeros((mel_length, phonemes), dtype=np.float32)
for i in range(mel_length):
for j in range(phonemes):
mean = np.mean(alignments[i * hop:i * hop + window, j])
mel_alignments[i, j] = mean
return mel_alignments
UPD: @kannadaraj I'm sorry but the previous code was not quite correct:
I changed my code above so don't use the old one.
@Dekakhrone thanks a lot.. I will try it out..
Would love to hear some samples using pre-alignment guided attention!
Hi @rafaelvalle! First of all, thanks for this repo!
Unfortunately my samples are not representative of you for the following reasons:
@Dekakhrone @CookiePPP I would like to thank you both regarding the Forced alignment help. That helped to train alignments more robust manner. The outputs generated sounds pretty good.
@Dekakhrone like in the discussion above I used the alignment procedure. But to learn the pauses better i built a simple liner embedding that represents the pauses in each line and concatenated it to the text encoder output embedding. That helped the system to learn the alignment and pause simultaneously.
@kannadaraj I'm glad to hear that the information was helpful to you! But I'm afraid I didn't get your approach, could you tell about it more?
Does any one try to implement this paper https://ieeexplore.ieee.org/document/8703406 Which used pre-aligned posterior to guide attetion. I tried to implement this method, but I I don't know how to map between whitespace character in alignments matrix (phoneme representation) with pre aligned posterior. Because in Forced alignment task don't align any whitespace.