Closed sungjae-cho closed 3 years ago
Yes, please do merge. This is needed.
How about an approach like the one below that doesn't require additional dependencies?
_words_re = re.compile(r"([a-zA-ZÀ-ž]+['][a-zA-ZÀ-ž]{1,2}|[a-zA-ZÀ-ž]+)|([{][^}]+[}]|[^a-zA-ZÀ-ž{}]+)")
words = _words_re.findall(text)
clean_text = [get_arpabet(word[0], dictionary)
if ((word[0] != '') and random.random() < p_arpabet) else word[1]
for word in words]
@rafaelvalle In my limited knowledge, the nltk approach would be easier to modify when training on a different language, though(Just adding language='language_name'
as an argument to word_tokenize()
). But this one is better, otherwise.
Right, one problem with NLTK is that it will break if arpabet is passed as input, e.g.:
text = "{L AO1 NG} {N EH1 R OW0} Rooms -- One thirty-six feet, Six Twenty Three feet,"
@rafaelvalle The point where the modification is made, is entered only when there is no ARPAbet remaining in the text
variable; I guess.
NLTK internally uses regexes only,if there's only a single line in input, I think. See here and here. But those regexes are language agnostic. And cover more general cases. So we can make your regex language-agnostic too.
Also, I think, it should have been
sequence += text_to_sequence(m.group(1), cleaner_names, dictionary, p_arpabet)
here.
Since we probably need to find pronunciation of the words in 1-st group, from the ARPAbet dictionary too. Shall I open a PR/Issue ?
@rafaelvalle
I've recently updated the repo to account for the cases discussed here, including arpabet input in text and text surrounded by punctuations. https://github.com/NVIDIA/mellotron/commit/b44dc8a172206722eeac4dd7e169e0537af7f7a3
@rafaelvalle I still don't think that the following code
word = _words_re.findall(_clean_text(m.group(1), cleaner_names))
if len(word):
sequence += _symbols_to_sequence(word[0][1])
sequence += _arpabet_to_sequence(m.group(2))
is searching for words in the first group, in the dictionary. Am I wrong ? What am I missing ?
Yes, thank your for observing that. I'll update the code soon. Code is updated.
Thank you so much @rafaelvalle . All your work is very beautiful :)
@Jeevesh8 Thank you for leading this discussion! The problem has been solved. Thus, I close this issue.
@Jeevesh8 thank you for your kind words. Please share your work if you use any of our work!
@sungjae-cho Thank you for opening the pr first ! :smile: Yes, I sure will @rafaelvalle . :+1:
The previous line uses
split
by splitting blanks, to get a list of words from a character string. With this approach, words concatenated with punctuation are considered a word, which results in the out-of-vocabulary problem for the CMU dictionary. I got the following sentence below with the previous implementation."{L AO1 NG} {N EH1 R OW0} {R UW1 M Z} -- {W AH1 N} thirty-six feet, {S IH1 K S} {T W EH1 N T IY0 TH R IY2} feet, {AH0 N D} {DH AH0} {EY1 T TH} eighteen,"
So, I suggest to use the NLTK tokenizer to split punctuation from words.