Closed tdymel closed 3 years ago
@Geigerkind what is a bug ?
Hey @dathudeptrai ! I dont think there is a bug, I think I am executing your framework wrong. Can you provide me with an example how to exucute a self trained model and generate speech from text?
Cheers!
@dathudeptrai I followed that, but there is not a single example on how to generate speech from a self trained model. At least I cant find one.
The code I posted used the tactotron2 notebook as template.
But when I execute the code like this, I cant load the model like this
processor = AutoProcessor.from_pretrained(model_path)
and this
tacotron2 = TFAutoModel.from_pretrained(model_path)
as it seems to expect another format.
Okay I think I came a little bit further now:
import yaml
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
import soundfile as sf
model_path = "examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-330.h5"
model_config = "examples/tacotron2/conf/tacotron2.v1.yaml"
tac_config = AutoConfig.from_pretrained(model_config)
tacotron2 = TFAutoModel.from_pretrained(model_path, config=tac_config, name="tacotron2")
model_processor = "tensorflow_tts/processor/pretrained/ljspeech_mapper.json"
processor = AutoProcessor.from_pretrained(model_processor)
input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
tf.convert_to_tensor([len(input_ids)], tf.int32),
tf.convert_to_tensor([0], dtype=tf.int32)
)
sf.write('./audio_before.wav', decoder_output, 22050, "PCM_16")
Still getting this failure though:
Traceback (most recent call last):
File "/home/shino/hacking/tts/train_voice/TensorflowTTS/test_create.py", line 31, in <module>
sf.write('./audio_before.wav', decoder_output, 22050, "PCM_16")
File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 316, in write
f.write(data)
File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 992, in write
written = self._array_io('write', data, len(data))
File "/home/shino/.local/lib/python3.9/site-packages/soundfile.py", line 1306, in _array_io
raise ValueError("Invalid shape: {0!r}".format(array.shape))
ValueError: Invalid shape: (1, 9, 80)
Okay, I figured out how it works:
import yaml
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
import soundfile as sf
# Tacotron2
model_path = "examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-330.h5"
model_config = "examples/tacotron2/conf/tacotron2.v1.yaml"
tac_config = AutoConfig.from_pretrained(model_config)
tacotron2 = TFAutoModel.from_pretrained(model_path, config=tac_config, name="tacotron2")
model_processor = "tensorflow_tts/processor/pretrained/ljspeech_mapper.json"
processor = AutoProcessor.from_pretrained(model_processor)
# Mel generator
model_config_mel = "examples/multiband_melgan/conf/multiband_melgan.v1.yaml"
model_path_mel = "examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/generator-330.h5"
mb_config = AutoConfig.from_pretrained(model_config_mel)
mb_melgan = TFAutoModel.from_pretrained(model_path_mel, config=mb_config)
input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
tf.convert_to_tensor([len(input_ids)], tf.int32),
tf.convert_to_tensor([0], dtype=tf.int32)
)
mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()
# melgan inference
audio = mb_melgan.inference(mel_outputs[None, ...])
# save to file
sf.write("./audio_test.wav", audio[0, :, 0], 22050)
You seem to first train a Tactotron2 model and extract durations etc. After that you can train a "Generator" such as fastspeech or multiband_melgan with the generated files. With these two you can generate a wav file.
In my test with 330 steps I got nothing, but maybe I will get better results as I progress.
Hello! I have trained a model using the Tactotron2 example. After a while I stopped and got a model at a checkpoint.
Now I would like to use this model to generate some speech from text. Im aware that this may not produce any good results, but before I commit further to this, I would like to have a Proof of Concept running.
I followed the notebooks and the examples in this repo and this is what I came up with.
It does not even load the model, so I assume I missed something major^^
Would appreciate every help!
Cheers!