TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.82k stars 812 forks source link

Qustion: How to actually use the trained model to generate speech? #622

Closed tdymel closed 3 years ago

tdymel commented 3 years ago

Hello! I have trained a model using the Tactotron2 example. After a while I stopped and got a model at a checkpoint.

Now I would like to use this model to generate some speech from text. Im aware that this may not produce any good results, but before I commit further to this, I would like to have a Proof of Concept running.

I followed the notebooks and the examples in this repo and this is what I came up with.

at test_create.py
import yaml
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

import soundfile as sf

model_path = "examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-330.h5"
processor = AutoProcessor.from_pretrained(model_path)

input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)

tacotron2 = TFAutoModel.from_pretrained(model_path)

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
)

sf.write('./audio_before.wav', decoder_output, 22050, "PCM_16")

It does not even load the model, so I assume I missed something major^^

Would appreciate every help!

Cheers!

dathudeptrai commented 3 years ago

@Geigerkind what is a bug ?

tdymel commented 3 years ago

Hey @dathudeptrai ! I dont think there is a bug, I think I am executing your framework wrong. Can you provide me with an example how to exucute a self trained model and generate speech from text?

Cheers!

dathudeptrai commented 3 years ago

@Geigerkind https://github.com/TensorSpeech/TensorFlowTTS/tree/master/notebooks

tdymel commented 3 years ago

@dathudeptrai I followed that, but there is not a single example on how to generate speech from a self trained model. At least I cant find one.

The code I posted used the tactotron2 notebook as template. But when I execute the code like this, I cant load the model like this processor = AutoProcessor.from_pretrained(model_path) and this tacotron2 = TFAutoModel.from_pretrained(model_path) as it seems to expect another format.

tdymel commented 3 years ago

Okay I think I came a little bit further now:

import yaml
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

import soundfile as sf

model_path = "examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-330.h5"
model_config = "examples/tacotron2/conf/tacotron2.v1.yaml"
tac_config = AutoConfig.from_pretrained(model_config)
tacotron2 = TFAutoModel.from_pretrained(model_path, config=tac_config, name="tacotron2")

model_processor = "tensorflow_tts/processor/pretrained/ljspeech_mapper.json"
processor = AutoProcessor.from_pretrained(model_processor)

input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
)

sf.write('./audio_before.wav', decoder_output, 22050, "PCM_16")

Still getting this failure though:

Traceback (most recent call last):
  File "/home/shino/hacking/tts/train_voice/TensorflowTTS/test_create.py", line 31, in <module>
    sf.write('./audio_before.wav', decoder_output, 22050, "PCM_16")
  File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 316, in write
    f.write(data)
  File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 992, in write
    written = self._array_io('write', data, len(data))
  File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 1306, in _array_io
    raise ValueError("Invalid shape: {0!r}".format(array.shape))
ValueError: Invalid shape: (1, 9, 80)
tdymel commented 3 years ago

Okay, I figured out how it works:

import yaml
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

import soundfile as sf

# Tacotron2
model_path = "examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-330.h5"
model_config = "examples/tacotron2/conf/tacotron2.v1.yaml"
tac_config = AutoConfig.from_pretrained(model_config)
tacotron2 = TFAutoModel.from_pretrained(model_path, config=tac_config, name="tacotron2")

model_processor = "tensorflow_tts/processor/pretrained/ljspeech_mapper.json"
processor = AutoProcessor.from_pretrained(model_processor)

# Mel generator
model_config_mel = "examples/multiband_melgan/conf/multiband_melgan.v1.yaml"
model_path_mel = "examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/generator-330.h5"
mb_config = AutoConfig.from_pretrained(model_config_mel)
mb_melgan = TFAutoModel.from_pretrained(model_path_mel, config=mb_config)

input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
)

mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()
# melgan inference
audio = mb_melgan.inference(mel_outputs[None, ...])

# save to file
sf.write("./audio_test.wav", audio[0, :, 0], 22050)

You seem to first train a Tactotron2 model and extract durations etc. After that you can train a "Generator" such as fastspeech or multiband_melgan with the generated files. With these two you can generate a wav file.

In my test with 330 steps I got nothing, but maybe I will get better results as I progress.