Closed samuel-lunii closed 3 years ago
@samuel-lunii can you share ur script u used to do inference ?
Yes, here it is :
import yaml
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
import IPython.display as ipd
import soundfile as sf
import time
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--make_models', type=bool, default = False)
parser.add_argument('--input_text', type=str, default = "I really love you my friend.")
parser.add_argument('--output_name', type = str, default = "tacotron2_synth")
parser.add_argument('--processor', type = str, default = "./processor/pretrained/synpaflex_mapper.json")
parser.add_argument('--tacotron_config', type = str, default = "./examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml")
parser.add_argument('--tacotron_weigths', type = str, default = "./examples/tacotron2/checkpoints/synpaflex/model-52000.h5")
parser.add_argument('--mbmelgan_config', type = str, default = "./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
parser.add_argument('--mbmelgan_weigths', type = str, default = "./examples/multiband_melgan/checkpoints/synpaflex/generator-320000.h5")
args = parser.parse_args()
input_text = args.input_text
processor = AutoProcessor.from_pretrained(args.processor)
config = AutoConfig.from_pretrained(args.tacotron_config)
if args.make_models :
# Configure Tacotron
fake_input_text = "i love you so much."
input_ids = processor.text_to_sequence(input_text)
input_ids = input_ids[0:-1]
tacotron2 = TFAutoModel.from_pretrained(
config=config,
pretrained_path=None,
is_build=True,
name="tacotron2"
)
tacotron2.setup_window(win_front=6, win_back=6)
tacotron2.setup_maximum_iterations(3000)
# Save to Pb
input_lengths = tf.convert_to_tensor([len(input_ids)], tf.int32)
input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)
speaker_ids = tf.convert_to_tensor([0], dtype=tf.int32)
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
input_ids=input_ids,
input_lengths=input_lengths,
speaker_ids=speaker_ids,
)
tacotron2.load_weights(args.tacotron_weigths)
# save model into pb and do inference. Note that signatures should be a tf.function with input_signatures.
tf.saved_model.save(tacotron2, "./model_tacotron", signatures=tacotron2.inference)
# Configure MB-MelGAN
melgan_config = AutoConfig.from_pretrained(args.mbmelgan_config)
mb_melgan = TFAutoModel.from_pretrained(
config=melgan_config,
pretrained_path=None,
is_build=False, # don't build model if you want to save it to pb. (TF related bug)
name="mb_melgan"
)
fake_mels = tf.random.uniform(shape=[4, 256, 80], dtype=tf.float32)
audios = mb_melgan.inference(fake_mels)
mb_melgan.load_weights(args.mbmelgan_weigths)
# Save to Pb
tf.saved_model.save(mb_melgan, "./model_mbmelgan", signatures=mb_melgan.inference)
#Synthesis
output_name = args.output_name
tacotron2 = tf.saved_model.load("./model_tacotron")
input_ids = processor.text_to_sequence(input_text)
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
tf.convert_to_tensor([len(input_ids)], tf.int32),
tf.convert_to_tensor([0], dtype=tf.int32)
)
mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()
mb_melgan = tf.saved_model.load("./model_mbmelgan")
audios = mb_melgan.inference(mel_outputs[None, ...])
sf.write("./outputs/" + output_name + ".wav", audios[0, :, 0], 22050)
@samuel-lunii can you pull the newest code and run again ?, there was a bug in TFAutoModel.from_pretrained yesterday :D
Ah ok :) I pulled the newest code and ran this script :
import numpy as np
import soundfile as sf
import yaml
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
from tensorflow_tts.inference import AutoConfig
# initialize tacotron2 model.
tac2_config = AutoConfig.from_pretrained("./examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml")
tacotron2 = TFAutoModel.from_pretrained("./examples/tacotron2/checkpoints/synpaflex/model-52000.h5",
config=tac2_config, name="tacotron2")
# initialize mb_melgan model
mb_config = AutoConfig.from_pretrained("./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
mb_melgan = TFAutoModel.from_pretrained("./examples/multiband_melgan/checkpoints/synpaflex/generator-320000.h5",
config=mb_config)
# inference
processor = AutoProcessor.from_pretrained("./tensorflow_tts/processor/pretrained/synpaflex_mapper.json")
ids = processor.text_to_sequence("Voici le texte que j'avais envie d'écrire pour tester le système de synthèse.")
# tacotron2 inference
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(ids, dtype=tf.int32), 0),
input_lengths=tf.expand_dims(tf.convert_to_tensor(len(ids), dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
)
mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()
# melgan inference
audio = mb_melgan.inference(mel_outputs[None, ...])
# save to file
sf.write("./audio_test_fr.wav", audio[0, :, 0], 22050)
Still getting random mel outputs...
@samuel-lunii can you check if the problem is about Text2mel model or melgan model ?.
i think it comes from Text2mel model, because I get different results each time I print(mel_outputs)
and plot(mel_outputs)
:
first time :
[[-1.1685759 -1.2047745 -1.3080378 ... -1.166595 -1.1850848
-1.1887721 ]
[-0.9884827 -1.0476902 -1.1208555 ... -1.1260802 -1.1324888
-1.1231279 ]
[-0.9577022 -1.0271982 -1.0529574 ... -1.132981 -1.1573727
-1.148672 ]
...
[-0.39674753 -0.66073245 -0.78027946 ... -0.6853353 -0.67043
-0.6731563 ]
[-0.44452816 -0.6084268 -0.6490945 ... -0.5704105 -0.5551065
-0.5707133 ]
[-0.14123309 -0.24117121 -0.237127 ... -0.21044426 -0.20983121
-0.23627539]]
Second time :
[[-1.2705129 -1.3017569 -1.4279091 ... -1.1590765 -1.1668264
-1.1523571 ]
[-1.1365231 -1.1602129 -1.246326 ... -1.0838338 -1.055539
-1.0570184 ]
[-1.0327929 -1.0753589 -1.1244559 ... -1.0430999 -1.0398345
-1.0334009 ]
...
[-0.21869813 -0.4889331 -0.5723229 ... -0.3848772 -0.44142717
-0.4273518 ]
[-0.20579815 -0.43203712 -0.48528364 ... -0.2880336 -0.33902803
-0.35105032]
[-0.10610583 -0.24984945 -0.28069073 ... -0.12415372 -0.18096048
-0.19688737]]
Note that even duration is different...
@samuel-lunii can you try to disable dropout on prenet (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/tacotron2.py#L477) ?
mmmmh randomness is removed (exact same results each time I print(mel_outputs)
) but result is not speech anymore :
@samuel-lunii try to use the original class of Model and Config like https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/train_tacotron2.py#L458-L460 then load_weights as normal :D.
Here is what I did :
# initialize tacotron2 model.
tac2_config = Tacotron2Config(dataset = "synpaflex", n_conv_encoder = 5, n_speakers = 1,
encoder_conv_activation = 'relu', reduction_factor = 1, prenet_activation="relu")
tacotron2 = TFTacotron2(config=tac2_config, name="tacotron2")
tacotron2._build()
tacotron2.load_weights("./examples/tacotron2/checkpoints/synpaflex/model-74000.h5")
Still getting random results when dropout is active ! ><
@samuel-lunii teacher forcing is still ok ?
@samuel-lunii
the output is different even if the sentence is kept the exact same
That's a feature of Tacotron 2, one input, infinite possibilities which makes it ideal for emotional voices.
The audio signals sound like a french version of the WaveNet examples where no text has been provided during training
Can you print the input_ids
and see what is passed into the model?
@samuel-lunii let check if ur preprocessed input_ids is corrected or not :))).
@ZDisket
That's a feature of Tacotron 2, one input, infinite possibilities which makes it ideal for emotional voices.
Ah ok ! good to know :)
Can you print the input_ids and see what is passed into the model?
So here is the output of print(ids)
in the above script :
[60, 53, 47, 41, 47, 12, 50, 43, 12, 58, 43, 62, 58, 43, 12, 55, 59, 43, 12, 48, 3, 39, 60, 39, 47, 57, 12, 43, 52, 60, 47, 43, 12, 42, 3, 65, 41, 56, 47, 56, 43, 12, 54, 53, 59, 56, 12, 58, 43, 57, 58, 43, 56, 12, 50, 43, 12, 57, 63, 57, 58, 66, 51, 43, 12, 42, 43, 12, 57, 63, 52, 58, 46, 66, 57, 43, 8, 83]
@dathudeptrai
let check if ur preprocessed input_ids is corrected or not :))).
Here is the synpaflex.py
processor I used for preprocessing :
import os
import re
import numpy as np
import soundfile as sf
from dataclasses import dataclass
from tensorflow_tts.processor import BaseProcessor
from tensorflow_tts.utils import cleaners
_pad = "pad"
_eos = "eos"
_punctuation = "!/\'(),-.:;? "
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ"
# Export all symbols:
SYNPAFLEX_SYMBOLS = (
[_pad] + list(_punctuation) + list(_letters) + [_eos]
)
@dataclass
class SynpaflexProcessor(BaseProcessor):
"""SynPaFlex processor."""
cleaner_names: str = "basic_cleaners"
positions = {
"wave_file": 0,
"text": 1,
}
train_f_name: str = "synpaflex.txt"
def create_items(self):
if self.data_dir:
with open(
os.path.join(self.data_dir, self.train_f_name), encoding="utf-8"
) as f:
self.items = [self.split_line(self.data_dir, line, "|") for line in f]
def split_line(self, data_dir, line, split):
parts = line.strip().split(split)
wave_file = parts[self.positions["wave_file"]]
text = parts[self.positions["text"]]
wav_path = os.path.join(data_dir, "wavs", f"{wave_file}.wav")
speaker_name = "synpaflex"
return text, wav_path, speaker_name
def setup_eos_token(self):
return _eos
def get_one_sample(self, item):
text, wav_path, speaker_name = item
# normalize audio signal to be [-1, 1], soundfile already norm.
audio, rate = sf.read(wav_path)
audio = audio.astype(np.float32)
# convert text to ids
text_ids = np.asarray(self.text_to_sequence(text), np.int32)
sample = {
"raw_text": text,
"text_ids": text_ids,
"audio": audio,
"utt_id": os.path.split(wav_path)[-1].split(".")[0],
"speaker_name": speaker_name,
"rate": rate,
}
return sample
def text_to_sequence(self, text):
sequence = []
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
sequence += self._symbols_to_sequence(
self._clean_text(text, [self.cleaner_names])
)
break
sequence += self._symbols_to_sequence(
self._clean_text(m.group(1), [self.cleaner_names])
)
sequence += self._arpabet_to_sequence(m.group(2))
text = m.group(3)
# add eos tokens
sequence += [self.eos_id]
return sequence
def _clean_text(self, text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception("Unknown cleaner: %s" % name)
text = cleaner(text)
return text
def _symbols_to_sequence(self, symbols):
return [self.symbol_to_id[s] for s in symbols if self._should_keep_symbol(s)]
def _arpabet_to_sequence(self, text):
return self._symbols_to_sequence(["@" + s for s in text.split()])
def _should_keep_symbol(self, s):
return s in self.symbol_to_id and s != "_" and s != "~"
def save_pretrained(self, saved_path):
os.makedirs(saved_path, exist_ok=True)
self._save_mapper(os.path.join(saved_path, PROCESSOR_FILE_NAME), {})
I compared a XXX-ids.npy
file from preprocessed data and I compared it to the result of the text_to_sequence()
method above, with the same sentence as input, and it gave the exact same result. I'm not sure that's what you meant though ?
@dathudeptrai
@samuel-lunii teacher forcing is still ok ?
I tried this :
text = "-- Ohé ! Pierre ralentit l'allure de ses chevaux, se retourna, et à la lueur du réverbère voisin, aperçut un individu, embossé dans un ample manteau, à collet relevé, et coiffé d'un casque [Note : tel est le nom que les Canadiens-Français donnent à leur coiffure d'hiver. Fin de la note.] en fourrure."
ids = processor.text_to_sequence(text)
# tacotron2 teacher forcing
input_data = np.load("./dump_synpaflex/valid/norm-feats/chevalier_filledupirate_010_009-norm-feats.npy")
decoder_outputs, forced_mel_outputs, stop_token_prediction, alignment_history = tacotron2(
input_ids = tf.expand_dims(tf.convert_to_tensor(np.array(ids), dtype=tf.int32), 0),
input_lengths = tf.expand_dims(tf.convert_to_tensor(len(ids), dtype=tf.int32), 0),
speaker_ids = tf.convert_to_tensor([0], dtype=tf.int32),
mel_gts = tf.expand_dims(tf.convert_to_tensor(input_data, dtype=tf.float32), 0),
mel_lengths = tf.convert_to_tensor(len(input_data), dtype=tf.int32),
)
And it also gave random speech.
@samuel-lunii can you give me ur alignment figure in training progress (evaluation) and in the real inference with and without teacher forcing in the same input ?
@dathudeptrai For now, here are the alignment figures with (top) and without (bottom) teacher forcing during inference :
And here is an alignment figure obtained during eval :
@samuel-lunii alignment during evaluation seems valid :))). Maybe you can try use teacher forcing to extract duration and training FS2, if FS2 still has this problem, there must be a bug in preprocessing steps :(.
@dathudeptrai So ! I struggle to extract durations, I will create a new issue about this.
Regarding preprocessed data, i made a script to compare re-synthesized normalized mel spectrograms (norm-feats) to the corresponding processed text (ids) :
#[...] usual imports [...]
import sounddevice as sd
from tensorflow_tts.processor import synpaflex as synpaflex_processor
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--input_mel', type=str, default=./dump_synpaflex/train/norm-feats/hugo_miserables_02_30_068-norm-feats.npy)
args = parser.parse_args()
mb_config = AutoConfig.from_pretrained("./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
mb_melgan = TFAutoModel.from_pretrained(melgan_pretrained_path, config=mb_config)
# get ids from ids folder
input_ids_path = args.input_mel.replace("norm-feats","ids")
input_ids = np.load(input_ids_path)
# convert ids to text
processor = synpaflex_processor.SynpaflexProcessor
processor._load_mapper(processor, "./tensorflow_tts/processor/pretrained/synpaflex_mapper.json")
text = processor._sequence_to_symbols(processor, input_ids)
print("".join(text))
# get mel spectrogram from norm-feats folder
mel_outputs = np.load(args.input_mel)
audio = mb_melgan.inference(mel_outputs[None, ...])
# listen to audio
sd.play(audio[0, :, 0], 22050)
status = sd.wait()
Success : the audio output corresponds exactly to the result of print("".join(text))
:
comme le capitaine prononçait ces mots, un éclair illumina les ondes de l'atlantique, puis une détonation se fit entendre et deux boulets ramés balayèrent le pont de l'alcyon.eos
I guess this means there is no error in preprocessing, right ?
For info, here is a (wrong) result of the tacotron TTS using the exact same text as input.
Did you find a solution for the issue @samuel-lunii ?
@ihshareef Not yet, still investigating ! I will let everyone know when I find the solution.
@ihshareef Not yet, still investigating ! I will let everyone know when I find the solution.
Hi, did you solve this problem? @samuel-lunii I also used my custom data (about 40h) to train the tacotron2 model. The training process seems fine, while in inference stage, various checkpoints (2k-58k) always output random speech for my test data.
Your solution would be greatly appreciated!
@samuel-lunii hi. I test the performance using my latest checkpoint (76k), the synthesized speech sounds good. Maybe you can further train you model and try again.
same result on my datasets! I trained fastspeech2 and tacotron2, same datasets, tacotron2 performance better on valid datasets, I saved the mel_output numpy on each eval steps, and reload those numpy, it was good. but when I turn to inference, tacotron2 give me chaos video( but it seems the word from my datasets ) randomly. fastspeech2 don't have this problem.
@dathudeptrai
I managed to extract durations and to train fastspeech2, and the audio output is also nonsense speech.
Have you got any idea of what I could be doing wrong in the preprocessing step ? The only difference I can see between SynPaFlex and LJSpeech is that SynPaFlex sample rate is 44.1 kHz, so there is a resampling operation during preprocessing, because I set the sample rate to 22.05 kHz in my synpaflex_preprocess.yaml
file. However, I think this should not be a problem.
@ttslr I trained the model up to 76k steps too, so I do not think further training would solve this problem.
@samuel-lunii can you share with me a duration list value of some samples here ? just want to make sure the duration is valid :D
@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :
the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the trim_threshold_in_db
correctly in my synpaflex_processor.yaml
file ?
Here are the duration values of the above sample :
[ 3 6 7 6 7 7 9 10 11 10 7 7 8 6 7 5 5 9 10 14 13 3 6 6
5 7 5 7 7 6 7 8 6 6 3 6 6 6 6 3 3 2 3 5 5 6 4]
@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :
the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the
trim_threshold_in_db
correctly in mysynpaflex_processor.yaml
file ?
yes, it could be a source of ur problem. You should tune the trim_threshold_in_db
in the config. I remember that I have added a note in the config file about this issue.
@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :
the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the
trim_threshold_in_db
correctly in mysynpaflex_processor.yaml
file ?
HI, I also noticed this in the first round, and I trimmed all speech samples with trim_threshold_in_db=20, but the synthesized speech is still random when training step < 80k.
Note that the synthesized speech sounds stable after 80k in my experiments.
@ttslr Hi, seems you are an expert in this field :D. I saw you have a lot of papers about TTS :D
@ttslr Hi, seems you are an expert in this field :D. I saw you have a lot of papers about TTS :D
Thank you! I'm just an ordinary TTS researcher. :) 😄
@dathudeptrai @ttslr
an update : I trained Tacotron2 with the correctly trimmed dataset up to 86k, and still get babbling speech, even with sentences copied from the training set as input.
My guess is that it may come from the multi-GPU training : the maximal length of the sentences in my dataset is 20s, with a sample rate of 22.05 kHz, which does not allow me to train on a single GPU without OOM.
LJSpeech maximal duration is about 10s, so I was able to train it on a single GPU. I will try to reduce the maximal length of my dataset to 10s in order to train it on single GPU, and I will let you know ! :)
@dathudeptrai @ZDisket @ttslr
Thank you all for your help ! :) So the problem came from multi-GPU training. Single-GPU training gave much better results ! I still get a few prononciation mistakes at 92k, but it might be because I only used 20 hours of speech.
Any idea of what happens during multi-GPU training ? Is the content of each sentence shuffled by accident across the GPUs, which would make the model learn incorrect speech ?
@dathudeptrai @ZDisket @ttslr
Thank you all for your help ! :) So the problem came from multi-GPU training. Single-GPU training gave much better results ! I still get a few prononciation mistakes at 92k, but it might be because I only used 20 hours of speech.
Any idea of what happens during multi-GPU training ? Is the content of each sentence shuffled by accident across the GPUs, which would make the model learn incorrect speech ?
Good to know that it now worked :D. Can you share a tensorboard of multi-gpu training and single-gpu training ?, it should not be a problem :(. I tried multi-gpu many times and it's still ok, i think the problem is about a global_batch_size
. I think global_batch_size
should in range(32,64) for tacotron training Tacotron2.
Multi-GPU :
Single-GPU :
Mmmh... I changed batch_size
to 8 in the conf file because I used 4 GPUs. This should give global_batch_size = 32
, according to this calculation in base_trainer.py
@samuel-lunii hmm, maybe you should use batch_size = 16 or 32 in each GPU. But see the loss curve, both single-gpu and multi-gpu is still valid to me :D
But see the loss curve, both single-gpu and multi-gpu is still valid to me :D
I know ! Strange isn't it ? :)
@samuel-lunii hmm, maybe you should use batch_size = 16 or 32 in each GPU.
Yes, maybe... I think I already have good enough results with 1 GPU, so I will close the issue for now. I will let you know if I switch back to multi-GPU training.
@samuel-lunii pls consider to contribute french model if you can :D
@samuel-lunii pls consider to contribute french model if you can :D
Of course ! I already sent you an e-mail about that.
Hi ! I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. I deleted sentences longer than 20 seconds from the dataset and ended up with around 30 hours of single speaker data.
I made a custom
synpaflex.py
processor in ./tensorflow_tts/processor/ with these symbols (adapted to french without arpabet) :I used
basic_cleaners
for text cleaning.in #182 the issue was similar, but the problem came from using
tacotron2.v1.yaml
as configuration file. I am using my owntacotron2.synpaflex.v1.yaml
for both training and inference.During synthesis, mel outputs are completely random : the output is different even if the sentence is kept the exact same. The audio signals sound like a french version of the WaveNet examples where no text has been provided during training, in the "Knowing What to Say" section of this page.
Here are my tensorboard results :
I must be doing something wrong somehow as I have been able to train on LJSpeech successfuly... Any idea ?