coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.87k stars 4.39k forks source link

Generated WAV garbled after 40 seconds #2078

Closed Manamama closed 2 years ago

Manamama commented 2 years ago

Describe the bug

Generation of longer phrases seems to garble the output after about the 30 second timestamp. The resulting files sound random: one was 60 second long, while this attached lasts only 40 seconds.

My box:

Python 3.8.10
OS: Ubuntu 20.04.5 LTS x86_64 
Kernel: 5.4.0-125-generic 
CPU: Intel i5-8265U (8) @ 3.900GHz 
GPU: Intel UHD Graphics 620 
Memory: 5598MiB / 7843MiB 

To Reproduce

tts --text "Nowhere in those kerosene years could she find a soft-headed match. The wife crossed over an ocean, red-faced and cheerless. She traded the flat pad of a stethoscope for a dining hall spatula. Life is two choices, she thinks: you hatch a life, or you pass through one. Photographs of a child swaddled in layers arrived by post. Money didn’t, to her embarrassment." --out_path test.wav && mplayer test.wav test1.wav.tar.gz test3.wav.tar.gz

Expected behavior

Non-garbled WAV file

Logs

/usr/lib/python3/dist-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Nowhere in those kerosene years could she find a soft-headed match. The wife crossed over an ocean, red-faced and cheerless. She traded the flat pad of a stethoscope for a dining hall spatula. Life is two choices, she thinks: you hatch a life, or you pass through one. Photographs of a child swaddled in layers arrived by post. Money didn’t, to her embarrassment.
 > Text splitted to sentences.
['Nowhere in those kerosene years could she find a soft-headed match.', 'The wife crossed over an ocean, red-faced and cheerless.', 'She traded the flat pad of a stethoscope for a dining hall spatula.', 'Life is two choices, she thinks: you hatch a life, or you pass through one.', 'Photographs of a child swaddled in layers arrived by post.', 'Money didn’t, to her embarrassment.']
money didn’t, to her embarrassment.
 [!] Character '’' not found in the vocabulary. Discarding it.
   > Decoder stopped with `max_decoder_steps` 10000
 > Processing time: 113.27926182746887
 > Real-time factor: 0.7940720796039662
 > Saving output to test.wav

### Environment

```shell
{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.1+cu102",
        "TTS": "0.8.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.10",
        "version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
    }
}

Additional context

The generated WAV files have random length.

thorstenMueller commented 2 years ago

Imho this isn't a bug. You could increase the max_decoder_steps in config.json, but this might lead to problems synthesizing shorter phrases. Or split your text and run synthesizing for multiple shorter phrases.

Manamama commented 2 years ago

Thanks.

  1. I could not find max_decoder_steps in the config.json located in the compiled: ~/.local/lib/python3.8/site-packages/TTS/server/conf.json nor in ~/.local/lib/python3.8/site-packages/TTS/tts I could not find it in the jsons in the original cloned files that quick, either. It also seems to be missing from the tts command line parameters, as well.

-> Where is that option located? (Do bear with me here, as I am not a developer or programmer.)

  1. If "it's not a bug but a feature", don't the input get split automatically?: > Text splitted to sentences. to accommodate for this?

  2. FYI, the garbled output happens also with other models, but does not happen that often with other sample longer texts; yet I have tested 5 samples so far only.

Manamama commented 2 years ago

Another bug, but most probably due to the paucity of the models themselves, the Dearest Creature in Creation poem is mispronounced, while e.g. Google's and Nuance TTS handle it fine.

erogol commented 2 years ago

I'd suggest trying other English models.the default model can sometimes garble.

Comparing this open source project with Google and Nuance is flattering but hard to match :)

I move this to discussions as it it not a dev issue.