NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

Something wrong with text padding #96

Closed arijitx closed 3 years ago

arijitx commented 3 years ago

The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18 symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

where as in this repo symbols symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

as zero padding is done in the text, it pads the rest of text with the first symbol, I suspect its the reason behind people getting strange attention weights learnt when training on different symbol set.

rafaelvalle commented 3 years ago

where did you find code that zero pads the text? this operation symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet is only creating a set of symbols, here represented as a list.

arijitx commented 3 years ago

Hi, sorry for not stating it properly, https://github.com/NVIDIA/mellotron/blob/master/data_utils.py#L124 her you padded the batch with zeros for the text ids, where as in your symbol set the 0th symbol is the "!" punc. So text id 0 is used to pad as well as for punctuation "!". it may not be a huge problem in the default case. But if someone is modifying the symbol set, with their own tokens, eg phonese, non english chars then it may be a problem as the symbol[0] token can be a commonly used phone or char.

rafaelvalle commented 3 years ago

~i see your point. this can have an impact on the encoder's text encoding (https://github.com/NVIDIA/mellotron/blob/master/model.py#L186)

i've replaced the text encoder's forward to handle padding differently.~

actually, now i really see your point. i think you're mixing text id and zero padding. the character at id zero is the first entry in the token embedding. it is a vector, very likely not all zeros. zero padding will pad the sequence of token embeddings with all zero vectors, very likely different from the token embedding look up table, layer if you will...

the changes i recently made to mellotron fix potential issue with batch norm statistics when the inputs are padded and very different in length.

arijitx commented 3 years ago

Sorry probably I misread your code, are you saying that the zero padding for the encoder input is done for embedding not the raw text_ids? if so then that makes sense, but here https://github.com/NVIDIA/mellotron/blob/master/data_utils.py#L125 it seems like it pads the text ids not the embedding,

text_padded = torch.LongTensor(len(batch), max_input_len) text_padded.zero_()

It initializes a tensor of all zeros and for each entry in a batch it stores the text_ids, so there is still that chance for the model to confuse between symbol[0] which is "!" in your case and padding. for example for a batch like , ! hello ! hi!

the input to the model will be !=0 h=1 e 2 l 3 o 4 i 5

0 1 2 3 3 4 0 1 5 0 0 0 0 0

Please let me know if my understanding is wrong.

arijitx commented 3 years ago

I just saw your fix https://github.com/NVIDIA/mellotron/commit/d5362ccae23984f323e3cb024a01ec1de0493aff, I think it fixes the issue :) Thanks :)