coqui-ai / TTS

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.69k stars 4.21k forks source link

[Bug] Custom model inference error [Unresolved] #3291

Closed 78Alpha closed 10 months ago

78Alpha commented 10 months ago

Describe the bug

Inferencing custom model fails to work for various reasons (Language, unable to synthesize audio, unexpected pathing, json errors)

To Reproduce

1.) Finetune model/Train model on Ljspeech dataset 2.) Run "tts --text "Text for TTS" --model_path path/to/model --config_path path/to/config.json --out_path speech.wav --language en" 3.) Errors [Language None is not supported. | raise TypeError("Invalid file: {0!r}".format(self.name))]

Expected behavior

Produces a voice fil with which to evaluate the model

Logs

(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language en
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 400, in inference_with_config
    "zh-cn" if language == "zh" else language in self.config.languages
AssertionError:  ❗ Language None is not supported. Supported languages are ['en', 'es', 'fr', 'de', 'it', 'pt', 'pl', 'tr', 'ru', 'nl', 'cs', 'ar', 'zh-cn', 'hu', 'ko', 'ja']

(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language en
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 415, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 476, in full_inference
    (gpt_cond_latent, speaker_embedding) = self.get_conditioning_latents(
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 351, in get_conditioning_latents
    audio = load_audio(file_path, load_sr)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 72, in load_audio
    audio, lsr = torchaudio.load(audiopath)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 204, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 1212, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: None

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.20.6",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.13",
        "version": "#1 SMP Thu Oct 5 21:02:42 UTC 2023"
    }
}

Additional context

Documentation pages had two different ways to infer the model, neither worked.

eginhard commented 10 months ago

Use --language_idx, not --language. I opened a PR to return a more useful error message.

78Alpha commented 10 months ago

Altered it to idx with the same result. In addition, I added a line in xtts to make it "en" as a test (after the --language_idx test, and it had the same result.

(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language_idx en
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 415, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 476, in full_inference
    (gpt_cond_latent, speaker_embedding) = self.get_conditioning_latents(
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 351, in get_conditioning_latents
    audio = load_audio(file_path, load_sr)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 72, in load_audio
    audio, lsr = torchaudio.load(audiopath)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 204, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torchaudio/_backend/soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/soundfile.py", line 1212, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: None

Adding a speaker_wav also has further trouble.

(coqui) alpha78@----------:/mnt/q/Utilities/CUDA/TTS/TTS/server$ tts --text "Text for TTS" --model_path ./tts_models/en/ljspeech/ --config_path ./tts_models/en/ljspeech/config.json --out_path speech.wav --language_idx en --speaker_wav ./PYRAv2Dataset_00001.wav
 > Using model: xtts
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/home/alpha78/anaconda3/envs/coqui/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/mnt/q/Utilities/CUDA/TTS/TTS/bin/synthesize.py", line 515, in main
    wav = synthesizer.tts(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 392, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 415, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 484, in full_inference
    return self.inference(
  File "/home/alpha78/anaconda3/envs/coqui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/models/xtts.py", line 528, in inference
    text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
  File "/mnt/q/Utilities/CUDA/TTS/TTS/tts/layers/xtts/tokenizer.py", line 650, in encode
    return self.tokenizer.encode(txt).ids
AttributeError: 'NoneType' object has no attribute 'encode'
planetMatrix commented 9 months ago

@78Alpha same exact error, after following the official Coqui Fine Tune Video

@eginhard already on the latest version. Please guide