TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 815 forks source link

German TTS with Thorsten dataset doesn't work #457

Closed mrezai closed 3 years ago

mrezai commented 3 years ago

Hi @monatis below code generates correct audio with https://www.dropbox.com/s/ixsp0bxck2di4rs/TensorFlowtts.zip?dl=1(from colab) but doesn't with the latest commit f046e824f18b4c7b7db2b65705a6f98f09cf9b48. There is a difference in output logs and with code cloned from repository we will see this warning: WARNING:tensorflow:Skipping loading of weights for layer encoder due to mismatch in shape ((149, 512) vs (156, 512)).



import yaml
import numpy as np
#import matplotlib.pyplot as plt

#import IPython.display as ipd

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

tacotron2_config = AutoConfig.from_pretrained('./examples/tacotron2/conf/tacotron2.v1.yaml')
tacotron2 = TFAutoModel.from_pretrained(
    config=tacotron2_config,
    pretrained_path="./thorsten-tacotron2.h5",
    name="tacotron2"
)

mb_melgan_config = AutoConfig.from_pretrained('./examples/multiband_melgan/conf/multiband_melgan.v1.yaml')
mb_melgan = TFAutoModel.from_pretrained(
    config=mb_melgan_config,
    pretrained_path="./thorsten-mbmelgan.h5",
    name="mb_melgan"
)

processor = AutoProcessor.from_pretrained(pretrained_path="./tensorflow_tts/processor/pretrained/thorsten_mapper.json")

def do_synthesis(input_text, text2mel_model, vocoder_model, text2mel_name, vocoder_name):
  input_ids = processor.text_to_sequence(input_text)

  # text2mel part
  if text2mel_name == "TACOTRON":
    _, mel_outputs, stop_token_prediction, alignment_history = text2mel_model.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
    )
  elif text2mel_name == "FASTSPEECH2":
    mel_before, mel_outputs, duration_outputs, _, _ = text2mel_model.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
        speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    )
  else:
    raise ValueError("Only TACOTRON, FASTSPEECH2 are supported on text2mel_name")

  # vocoder part
  if vocoder_name == "MB-MELGAN":
    # tacotron-2 generate noise in the end symtematic, let remove it :v.
    if text2mel_name == "TACOTRON":
      remove_end = 1024
    else:
      remove_end = 1
    audio = vocoder_model.inference(mel_outputs)[0, :-remove_end, 0]
  else:
    raise ValueError("Only MB_MELGAN are supported on vocoder_name")

  if text2mel_name == "TACOTRON":
    return mel_outputs.numpy(), alignment_history.numpy(), audio.numpy()
  else:
    return mel_outputs.numpy(), audio.numpy()

input_text = "Möchtest du das meiner Frau erklären? Nein? Ich auch nicht."

# setup window for tacotron2 if you want to try
tacotron2.setup_window(win_front=5, win_back=5)

mels, alignment_history, audios = do_synthesis(input_text, tacotron2, mb_melgan, "TACOTRON", "MB-MELGAN")

#ipd.Audio(audios, rate=22050)
sf.write('./audio.wav', audios, 22050, "PCM_16")```
monatis commented 3 years ago

Hi @mrezai, Add: tacotron2_config.__dict__['vocab_size'] = 156 right after tacotron2_config = AutoConfig.from_pretrained('./examples/tacotron2/conf/tacotron2.v1.yaml')

This is a known bug, and it will be fixed when my new training is complete.

mrezai commented 3 years ago

Thanks, it works!