coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.53k stars 4.19k forks source link

[Bug] Chinese text-to-speech voice do not end correctly. #1591

Closed snowyu closed 2 years ago

snowyu commented 2 years ago

Describe the bug

Chinese text-to-speech sounds do not end correctly. There will be multiple incorrect tails at the end:

https://user-images.githubusercontent.com/327887/169764716-027dbff6-dc62-4d65-b5ce-38e45b85fb48.mov

To Reproduce

tts --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST  --text "你好"

Expected behavior

https://user-images.githubusercontent.com/327887/169765160-a0f18ffc-c39c-499f-bf44-440a7145729f.mov

Logs

tts --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST  --text "你好"
 > tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.
 > Using model: tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/riceball/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 2
 > Using Griffin-Lim as no vocoder model defined
 > Text: 你好
 > Text splitted to sentences.
['你好']
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
DEBUG:jieba:Loading model from cache /tmp/jieba.cache
Loading model cost 0.298 seconds.
DEBUG:jieba:Loading model cost 0.298 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 4.163171768188477
 > Real-time factor: 0.34543747926032536
 > Saving output to tts_output.wav

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 980 Ti"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#31-Ubuntu SMP Thu May 5 10:00:34 UTC 2022"
    }
}

Additional context

No response

WeberJulian commented 2 years ago

Hey, this is not a bug. This is something related to the dataset and how tacotron works. If you want the model do be stable with small input, your data should have that.

Btw this works: tts --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST --text "你好。"

https://user-images.githubusercontent.com/17219561/169831972-ac1eb6e4-4854-4fff-ba4a-62abdec905af.mov

snowyu commented 2 years ago

@WeberJulian Thanks a lot, but why the half-width symbol "你好." doesn't work?

WeberJulian commented 2 years ago

No idea sorry, I don't speak Chinese

snowyu commented 2 years ago

@WeberJulian Then how to add these ".!?" chars to the stop words?

WeberJulian commented 2 years ago

Not sure what you mean, sorry

erogol commented 2 years ago

@WeberJulian Then how to add these ".!?" chars to the stop words?

you can write your own text processor and preprocess the text before passing it to 🐸TTS. However, you need a bit of coding.