Closed NormanTUD closed 2 years ago
Can you try a bit longer sentence with the world 'hello' in it occurring multiple times as well as just once and post the results? Thanks.
Very similiar results when doing tts --out_path hello.mp3 --text "hello hello hello"
:
I see, I am away from my PC, so can't test more things right now, what about "hello, my name is Max" or something like that? Does a normally long sentence perform okay? If all these tests fail, might need to retrain, I guess.
"Hello my name is max" works perfectly fine.
Is there still something to do? I have no example here right now, but I've seen this behaviour even in some longer sentences.
Thanks
So, I believe it is the dataset that was trained on. For now read this from the docs about good TTS dataset. I think there are not many "small" or "single word" sentences in the dataset for the model to learn.
Ok, I might be wrong, but this is my experience so far. We do not teach the models to speak using usual human-like techniques of learning alphabets, then single words, then grammar and finally longer sentences. All we do is teach the model to imitate us while conditioning on the text input. It is like how we can imitate a cat 'mewing', we do not know what the cat means, we just copy the different 'mews'. We condition the 'mewing' based on different situations, like 'mew1' for 'hunger', 'mew2' for 'joy', etc. We can just imitate the sounds correctly, but we will never know what part of mew actually means what unless the cat actually decides to teach it to us ๐ธ.
Add punctuation "Hello."
Tacotron2 models require stopwords
to know when to stop synthesizing.
So it is not really a bug, but rather the way the architecture of tacotron2 works.
Add punctuation "Hello."
Tacotron2 models require
stopwords
to know when to stop synthesizing.So it is not really a bug, but rather the way the architecture of tacotron2 works.
@lexkoro Then why does the sentence "hello my name is max" work correctly without a stopword
?
Because at each decoding step it predicts a probability and if the probability is over a certain threshold ( I think it is 0.5 in the code.) it will stop decoding. For the given sentence it hits the threshold and stops at the correct position, might be because the model seen similar data in the train set. If you change or extend the sentence it might just fail again. So adding a stopword tells the decoder where to stop.
@lexkoro Okay! Makes sense now. ๐ So, maybe if we try 'hello.' with the stop-word it should work. I will wait for @NormanTUD to try it out.
Hi,
yes, "hello." works. Is there a reason not to automatically add "." to sentences inputted if $input !~ /\.$/
?
You need to end with punctuation for most of the models since they are trained with datasets in which texts always end with punctuations.
Describe the bug
Sometimes I get really strange outputs. Like this one:
` tts --out_path hello.mp3 --text "hello"
No idea what I'm doing wrong.
To Reproduce
tts --out_path hello.mp3 --text "hello"
Expected behavior
No response
Logs
No response
Environment
Additional context
No response