NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.06k stars 1.38k forks source link

Implement pre-alignment guided attention #342

Open chazo1994 opened 4 years ago

chazo1994 commented 4 years ago

Does any one try to implement this paper https://ieeexplore.ieee.org/document/8703406 Which used pre-aligned posterior to guide attetion. I tried to implement this method, but I I don't know how to map between whitespace character in alignments matrix (phoneme representation) with pre aligned posterior. Because in Forced alignment task don't align any whitespace.

CookiePPP commented 4 years ago

@chazo1994 I tried to follow the idea. Here is my repo And the Working(?) Colab file.

I take the alignments from the LJSpeech pretrained model. In future I will try Montreal Force-Aligner or Mellotron to generate alignments since LJSpeech model does not provide very accurate alignments and will decrease quality when using different speakers.


Edit: The paper also uses L1 Loss on the Decoder, do you have any idea why that is or if it would effect results? I'll leave it as MSE for now, but maybe worth testing another time.

chazo1994 commented 4 years ago

@chazo1994 I tried to follow the idea. Here is my repo And the Working(?) Colab file.

I take the alignments from the LJSpeech pretrained model. In future I will try Montreal Force-Aligner or Mellotron to generate alignments since LJSpeech model does not provide very accurate alignments and will decrease quality when using different speakers.

Edit: The paper also uses L1 Loss on the Decoder, do you have any idea why that is or if it would effect results? I'll leave it as MSE for now, but maybe worth testing another time.

You could try to use L1 loss, but i think MSE is good. Using LJSpeech is not good idea, because LJspeech dataset only have short audios, so the alignment for long sentence is not good. You should try to follow the method of this paper.

chazo1994 commented 4 years ago

@CookiePPP I think even if you use Montreal Force Aligner, you still face Whitespace problems.

kannadaraj commented 4 years ago

@CookiePPP thanks for sharing.. it is good idea to use Montreal Forced aligner.. Once we get the phoneme segmentation information; how so you generate the alignment information our of that to give it as input during training?

CookiePPP commented 4 years ago

@kannadaraj I've not ran Montreal Forced aligner before so I'll get back to you if/when I use it. I'd like to look into Mellotron first since I wouldn't have to modify much and Mellotron can already align both graphemes and phonemes. edit: I'm distracted looking at Jukebox atm.

Dekakhrone commented 4 years ago

@chazo1994 I tried MFA and encountered the problem with whitespaces as well as with missed punctuation, so I just wrote script to restore missed parts. I'm comparing word-level representation from MFA with my text so I am able to guess where missed symbols have to been put. In my implementation script takes some time from nearest symbol to restore missed one. But I'm not sure it's a good idea anyway, haven't tested it yet.

Dekakhrone commented 4 years ago

@kannadaraj I used this code

def convert_to_matrix(phonemes, duration, sample_rate, window, hop):
    samples = int(duration * sample_rate)
    alignments = np.zeros((samples, len(phonemes)), dtype=np.int32)

    for i, phoneme in enumerate(phonemes):
        start = int(sample_rate * phoneme.start)
        end = int(sample_rate * phoneme.end)

        alignments[start:end, i] = 1

    spect_alignments = align_as_spect(alignments, hop, window)

    return spect_alignments

def align_as_spect(alignments, hop, window):
    samples, phonemes = alignments.shape

    pad = window // 2
    mel_length = (samples + 2 * pad - (window - 1) - 1) // hop + 1
    mel_alignments = np.zeros((mel_length, phonemes), dtype=np.float32)

    for i in range(mel_length):
        for j in range(phonemes):
            mean = np.mean(alignments[i * hop:i * hop + window, j])
            mel_alignments[i, j] = mean

    return mel_alignments

UPD: @kannadaraj I'm sorry but the previous code was not quite correct:

I changed my code above so don't use the old one.

kannadaraj commented 4 years ago

@Dekakhrone thanks a lot.. I will try it out..

rafaelvalle commented 4 years ago

Would love to hear some samples using pre-alignment guided attention!

Dekakhrone commented 4 years ago

Hi @rafaelvalle! First of all, thanks for this repo!

Unfortunately my samples are not representative of you for the following reasons:

  1. I trained my tacotron modification using the Russian dataset;
  2. As I mentioned above, MFA quite often produced timestamps without taking into account spaces and punctuation marks, so I wrote code to fix this moment. After all, I launched three trainings simultaneously: conventional tacotron2, tacotron2 with diagonal guided attention (DGA) and tacotron2 with prealigned guided attention (PGA). In the end the best result was shown by taco with DGA, and the worst, surprisingly, with PGA. Probably the poor results with PGA were triggered by by my manipulations with timestamps, but I don't know the exact reason.
kannadaraj commented 4 years ago

@Dekakhrone @CookiePPP I would like to thank you both regarding the Forced alignment help. That helped to train alignments more robust manner. The outputs generated sounds pretty good.

@Dekakhrone like in the discussion above I used the alignment procedure. But to learn the pauses better i built a simple liner embedding that represents the pauses in each line and concatenated it to the text encoder output embedding. That helped the system to learn the alignment and pause simultaneously.

Dekakhrone commented 4 years ago

@kannadaraj I'm glad to hear that the information was helpful to you! But I'm afraid I didn't get your approach, could you tell about it more?