NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Replace split word_tokenize to deal with punctuations. #75

Closed sungjae-cho closed 3 years ago

sungjae-cho commented 4 years ago

The previous line uses split by splitting blanks, to get a list of words from a character string. With this approach, words concatenated with punctuation are considered a word, which results in the out-of-vocabulary problem for the CMU dictionary. I got the following sentence below with the previous implementation.

"{L AO1 NG} {N EH1 R OW0} {R UW1 M Z} -- {W AH1 N} thirty-six feet, {S IH1 K S} {T W EH1 N T IY0 TH R IY2} feet, {AH0 N D} {DH AH0} {EY1 T TH} eighteen,"

So, I suggest to use the NLTK tokenizer to split punctuation from words.

Jeevesh8 commented 3 years ago

Yes, please do merge. This is needed.

rafaelvalle commented 3 years ago

How about an approach like the one below that doesn't require additional dependencies?

_words_re = re.compile(r"([a-zA-ZÀ-ž]+['][a-zA-ZÀ-ž]{1,2}|[a-zA-ZÀ-ž]+)|([{][^}]+[}]|[^a-zA-ZÀ-ž{}]+)")
words = _words_re.findall(text)
clean_text = [get_arpabet(word[0], dictionary)
              if ((word[0] != '') and random.random() < p_arpabet) else word[1]
              for word in words]
Jeevesh8 commented 3 years ago

@rafaelvalle In my limited knowledge, the nltk approach would be easier to modify when training on a different language, though(Just adding language='language_name' as an argument to word_tokenize() ). But this one is better, otherwise.

rafaelvalle commented 3 years ago

Right, one problem with NLTK is that it will break if arpabet is passed as input, e.g.: text = "{L AO1 NG} {N EH1 R OW0} Rooms -- One thirty-six feet, Six Twenty Three feet,"

Jeevesh8 commented 3 years ago

@rafaelvalle The point where the modification is made, is entered only when there is no ARPAbet remaining in the text variable; I guess. NLTK internally uses regexes only,if there's only a single line in input, I think. See here and here. But those regexes are language agnostic. And cover more general cases. So we can make your regex language-agnostic too.

Jeevesh8 commented 3 years ago

Also, I think, it should have been

sequence += text_to_sequence(m.group(1), cleaner_names, dictionary, p_arpabet)

here.

Since we probably need to find pronunciation of the words in 1-st group, from the ARPAbet dictionary too. Shall I open a PR/Issue ?

Jeevesh8 commented 3 years ago

@rafaelvalle

rafaelvalle commented 3 years ago

I've recently updated the repo to account for the cases discussed here, including arpabet input in text and text surrounded by punctuations. https://github.com/NVIDIA/mellotron/commit/b44dc8a172206722eeac4dd7e169e0537af7f7a3

Jeevesh8 commented 3 years ago

@rafaelvalle I still don't think that the following code

word = _words_re.findall(_clean_text(m.group(1), cleaner_names))
    if len(word):
        sequence += _symbols_to_sequence(word[0][1])
    sequence += _arpabet_to_sequence(m.group(2))

is searching for words in the first group, in the dictionary. Am I wrong ? What am I missing ?

rafaelvalle commented 3 years ago

Yes, thank your for observing that. I'll update the code soon. Code is updated.

Jeevesh8 commented 3 years ago

Thank you so much @rafaelvalle . All your work is very beautiful :)

sungjae-cho commented 3 years ago

@Jeevesh8 Thank you for leading this discussion! The problem has been solved. Thus, I close this issue.

rafaelvalle commented 3 years ago

@Jeevesh8 thank you for your kind words. Please share your work if you use any of our work!

Jeevesh8 commented 3 years ago

@sungjae-cho Thank you for opening the pr first ! :smile: Yes, I sure will @rafaelvalle . :+1: