fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.13k stars 697 forks source link

deal with a new word and add pronunciation #65

Open lalimili6 opened 5 years ago

lalimili6 commented 5 years ago

Hi how can learn model new words that don't see it. How can learn model a new pronunciation and how can deal with crossover pronunciation of two neighbor words? In HMM-based we can generate phones of a sentence and pass it to model, what about in this end to end model? I think that the network is too close. Is it possible that record few new waves by new speaker and pass it to model to learn? best regards

oytunturk commented 5 years ago

This implementation supports text input only. It should be straightforward to add phoneme input support if you can already generate pronunciations for your input text. If you have a pronunciation model that applies cross-word phonology, Tacotron network should be able to learn based on training data.

Passing reference waveforms as input requires expanding the Tacotron network to accept additional input. It might be easier to pass a speaker embedding vector along with text or phoneme input. You may want to look into various open source speaker verification model implementations to extract speaker embeddings. It is possible to extend the Tacotron network with a similar speaker embedding encoder and that's actually what they do in Tacotron Global Style Tokens.

lalimili6 commented 5 years ago

thanks for training based on phone instead of text, train based on a sequence of phones e.g. s ē k w ə n s w o r d or based on word sequence e.g. sēkwəns word

acrosson commented 5 years ago

hey @oytunturk , i'm interested in learning how to approach this as well. If we used CMU's dictionary for mapping words to phonemes. What approach would you take for input of the tacotron model?

Like @lalimili6 's question, should we space each phoneme apart with one space and then use a separate symbol for pauses, or no pause symbol, or should we concatenate the phonemes together and treat them like words for input?

oytunturk commented 5 years ago

I think separating phonemes by space would be the correct approach. Concatenating phonemes will basically be the same as providing words as input using a different set of graphemes. It might be useful to introduce a sentence start symbol and long/short pause symbols assuming you can get them from an ASR model or predict somehow based on recordings. I think Tacotron implementation already handles end of sentence by predicting a 'stop' symbol.

Providing phonemes as input should in theory speed up Tacotron training as the network has less to learn. However, mismatches between actual recordings and phonemes from a dictionary may cause problems. Forced aligned phoneme sequences may work better if they are based on a reliable acoustic model. I'm not entirely sure which option would work the best. I think it requires some experimentation.

fatchord commented 5 years ago

@lalimili6, @oytunturk, @acrosson In previous experiments with phonemes I've found that it didn't work as well as I'd like (it sounded about the same, but not better as I was expecting). I'm not sure why but I did notice that CMU dictionary does not cover as many english words I thought. When inspecting the training examples, most of them still had some graphemes in them.

Currently I am completely refactoring the tacotron model and making a couple of key changes. I'll try phonemes again with the new model and see how it goes. Also if any of you know of a bigger dictionary than CMU please let me know.

lalimili6 commented 5 years ago

thanks all @fatchord Do you check these resources http://www.openslr.org/resources.php? these are for Kaldi toolkit and train speech recognition and must have lexicon. may check TED-LIUM (1,2,3) and they must have lexicon. LibriSpeech http://www.openslr.org/11/ have lexicon http://www.openslr.org/resources/11/librispeech-lexicon.txt and a g2p model that can use it for train lexicon also Kaldi have scripts that can learn lexicon based waves and transcribes https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium/s5_r2/local/run_learn_lex_greedy.sh. @oytunturk My question about spaces between phonemes, based on to how tacotron alignment texts and waves. I don't have any idea but in my sense, since tacotron is end-to-end It'd better to not put spaces between phonemes.

fatchord commented 5 years ago

@lalimili6 That's a great list of resources, thanks!

oytunturk commented 5 years ago

@lalimili6 Without spaces, Tacotron would consider each word as a separate entity and may not be able to take full advantage of phonetic context. It would just behave as another word-level representation maybe with some flexibility to identify words that are transcribed the same but pronounced differently. What I would try first would be to introduce spaces between phonemes and maybe even word end symbols to label word boundaries, pauses, sentence starts and sentence ends. I think this is also related to how much training data is available. Introducing phonemes may work better if you have, say several hours of data, which would make it very difficult for Tacotron to learn correct pronunciations straight from text.