[Bug] Strange output results from simple words

NormanTUD commented 2 years ago

Describe the bug

Sometimes I get really strange outputs. Like this one:

` tts --out_path hello.mp3 --text "hello"

tts_models/en/ljspeech/tacotron2-DDC is already downloaded. vocoder_models/en/ljspeech/hifigan_v2 is already downloaded. Using model: Tacotron2 Setting up Audio Processor... | > sample_rate:22050 | > resample:False | > num_mels:80 | > log_func:np.log | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:False | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:8000.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:None | > base:2.718281828459045 | > hop_length:256 | > win_length:1024 Model's reduction rate r is set to: 1 Vocoder Model: hifigan Setting up Audio Processor... | > sample_rate:22050 | > resample:False | > num_mels:80 | > log_func:np.log | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:False | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:8000.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:False | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:None | > base:2.718281828459045 | > hop_length:256 | > win_length:1024 Generator Model: hifigan_generator Discriminator Model: hifigan_discriminator Removing weight norm... Text: hello Text splitted to sentences. ['hello'] Decoder stopped with max_decoder_steps 500 Processing time: 2.7834296226501465 Real-time factor: 0.4366435912025878 Saving output to hello.mp3 `

No idea what I'm doing wrong.

To Reproduce

tts --out_path hello.mp3 --text "hello"

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.1",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.2",
        "version": "#1 SMP Debian 5.10.106-1 (2022-03-17)"
    }
}

Additional context

No response

NormanTUD commented 2 years ago

https://user-images.githubusercontent.com/34073778/164756264-0f05cb97-4084-4b3b-9283-da7106c43a72.mp4

p0p4k commented 2 years ago

Can you try a bit longer sentence with the world 'hello' in it occurring multiple times as well as just once and post the results? Thanks.

NormanTUD commented 2 years ago

Very similiar results when doing tts --out_path hello.mp3 --text "hello hello hello":

https://user-images.githubusercontent.com/34073778/165048445-cadca7aa-b942-4b45-98ac-16c0f3c43c22.mp4

p0p4k commented 2 years ago

I see, I am away from my PC, so can't test more things right now, what about "hello, my name is Max" or something like that? Does a normally long sentence perform okay? If all these tests fail, might need to retrain, I guess.

NormanTUD commented 2 years ago

"Hello my name is max" works perfectly fine.

Is there still something to do? I have no example here right now, but I've seen this behaviour even in some longer sentences.

Thanks

p0p4k commented 2 years ago

So, I believe it is the dataset that was trained on. For now read this from the docs about good TTS dataset. I think there are not many "small" or "single word" sentences in the dataset for the model to learn.

p0p4k commented 2 years ago

Ok, I might be wrong, but this is my experience so far. We do not teach the models to speak using usual human-like techniques of learning alphabets, then single words, then grammar and finally longer sentences. All we do is teach the model to imitate us while conditioning on the text input. It is like how we can imitate a cat 'mewing', we do not know what the cat means, we just copy the different 'mews'. We condition the 'mewing' based on different situations, like 'mew1' for 'hunger', 'mew2' for 'joy', etc. We can just imitate the sounds correctly, but we will never know what part of mew actually means what unless the cat actually decides to teach it to us 😸.

lexkoro commented 2 years ago

Add punctuation "Hello."

Tacotron2 models require stopwords to know when to stop synthesizing.

So it is not really a bug, but rather the way the architecture of tacotron2 works.

p0p4k commented 2 years ago

Add punctuation "Hello."

Tacotron2 models require stopwords to know when to stop synthesizing.

So it is not really a bug, but rather the way the architecture of tacotron2 works.

@lexkoro Then why does the sentence "hello my name is max" work correctly without a stopword?

lexkoro commented 2 years ago

Because at each decoding step it predicts a probability and if the probability is over a certain threshold ( I think it is 0.5 in the code.) it will stop decoding. For the given sentence it hits the threshold and stops at the correct position, might be because the model seen similar data in the train set. If you change or extend the sentence it might just fail again. So adding a stopword tells the decoder where to stop.

p0p4k commented 2 years ago

@lexkoro Okay! Makes sense now. 😅 So, maybe if we try 'hello.' with the stop-word it should work. I will wait for @NormanTUD to try it out.

NormanTUD commented 2 years ago

Hi, yes, "hello." works. Is there a reason not to automatically add "." to sentences inputted if $input !~ /\.$/?

erogol commented 2 years ago

You need to end with punctuation for most of the models since they are trained with datasets in which texts always end with punctuations.

coqui-ai / TTS