TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 815 forks source link

Tacotron2 produces random mel outputs during inference (french dataset) #581

Closed samuel-lunii closed 3 years ago

samuel-lunii commented 3 years ago

Hi ! I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. I deleted sentences longer than 20 seconds from the dataset and ended up with around 30 hours of single speaker data.

I made a custom synpaflex.py processor in ./tensorflow_tts/processor/ with these symbols (adapted to french without arpabet) :

_pad = "pad"
_eos = "eos"
_punctuation = "!/\'(),-.:;? "
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ"

# Export all symbols:
SYNPAFLEX_SYMBOLS = (
    [_pad] + list(_punctuation) + list(_letters) + [_eos]
)

I used basic_cleaners for text cleaning.

in #182 the issue was similar, but the problem came from using tacotron2.v1.yaml as configuration file. I am using my own tacotron2.synpaflex.v1.yaml for both training and inference.

During synthesis, mel outputs are completely random : the output is different even if the sentence is kept the exact same. The audio signals sound like a french version of the WaveNet examples where no text has been provided during training, in the "Knowing What to Say" section of this page.

Here are my tensorboard results : image

I must be doing something wrong somehow as I have been able to train on LJSpeech successfuly... Any idea ?

dathudeptrai commented 3 years ago

@samuel-lunii can you share ur script u used to do inference ?

samuel-lunii commented 3 years ago

Yes, here it is :

import yaml
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

import IPython.display as ipd
import soundfile as sf
import time

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--make_models', type=bool, default = False)
parser.add_argument('--input_text', type=str, default = "I really love you my friend.")
parser.add_argument('--output_name', type = str, default = "tacotron2_synth")
parser.add_argument('--processor', type = str, default = "./processor/pretrained/synpaflex_mapper.json")
parser.add_argument('--tacotron_config', type = str, default = "./examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml")
parser.add_argument('--tacotron_weigths', type = str, default = "./examples/tacotron2/checkpoints/synpaflex/model-52000.h5")
parser.add_argument('--mbmelgan_config', type = str, default = "./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
parser.add_argument('--mbmelgan_weigths', type = str, default = "./examples/multiband_melgan/checkpoints/synpaflex/generator-320000.h5")

args = parser.parse_args() 

input_text = args.input_text

processor = AutoProcessor.from_pretrained(args.processor)
config = AutoConfig.from_pretrained(args.tacotron_config)

if args.make_models : 

    # Configure Tacotron
    fake_input_text = "i love you so much."
    input_ids = processor.text_to_sequence(input_text)
    input_ids = input_ids[0:-1]

    tacotron2 = TFAutoModel.from_pretrained(
        config=config, 
        pretrained_path=None,
        is_build=True,
        name="tacotron2"
    )
    tacotron2.setup_window(win_front=6, win_back=6)
    tacotron2.setup_maximum_iterations(3000)

    # Save to Pb
    input_lengths = tf.convert_to_tensor([len(input_ids)], tf.int32)
    input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0)
    speaker_ids = tf.convert_to_tensor([0], dtype=tf.int32)

    decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
            input_ids=input_ids,
            input_lengths=input_lengths,
            speaker_ids=speaker_ids,
    )
    tacotron2.load_weights(args.tacotron_weigths)
    # save model into pb and do inference. Note that signatures should be a tf.function with input_signatures.
    tf.saved_model.save(tacotron2, "./model_tacotron", signatures=tacotron2.inference)

    # Configure MB-MelGAN
    melgan_config = AutoConfig.from_pretrained(args.mbmelgan_config)
    mb_melgan = TFAutoModel.from_pretrained(
        config=melgan_config, 
        pretrained_path=None, 
        is_build=False, # don't build model if you want to save it to pb. (TF related bug)
        name="mb_melgan"
    )
    fake_mels = tf.random.uniform(shape=[4, 256, 80], dtype=tf.float32)
    audios = mb_melgan.inference(fake_mels)

    mb_melgan.load_weights(args.mbmelgan_weigths)
    # Save to Pb
    tf.saved_model.save(mb_melgan, "./model_mbmelgan", signatures=mb_melgan.inference)

#Synthesis
output_name = args.output_name
tacotron2 = tf.saved_model.load("./model_tacotron")
input_ids = processor.text_to_sequence(input_text)

decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
)

mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()

mb_melgan = tf.saved_model.load("./model_mbmelgan")
audios = mb_melgan.inference(mel_outputs[None, ...])
sf.write("./outputs/" + output_name + ".wav", audios[0, :, 0], 22050)
dathudeptrai commented 3 years ago

@samuel-lunii can you pull the newest code and run again ?, there was a bug in TFAutoModel.from_pretrained yesterday :D

samuel-lunii commented 3 years ago

Ah ok :) I pulled the newest code and ran this script :

import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
from tensorflow_tts.inference import AutoConfig

# initialize tacotron2 model.
tac2_config = AutoConfig.from_pretrained("./examples/tacotron2/conf/tacotron2.synpaflex.v1.yaml")
tacotron2 = TFAutoModel.from_pretrained("./examples/tacotron2/checkpoints/synpaflex/model-52000.h5",
    config=tac2_config, name="tacotron2")

# initialize mb_melgan model
mb_config = AutoConfig.from_pretrained("./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
mb_melgan = TFAutoModel.from_pretrained("./examples/multiband_melgan/checkpoints/synpaflex/generator-320000.h5",
   config=mb_config)

# inference
processor = AutoProcessor.from_pretrained("./tensorflow_tts/processor/pretrained/synpaflex_mapper.json")
ids = processor.text_to_sequence("Voici le texte que j'avais envie d'écrire pour tester le système de synthèse.")
# tacotron2 inference
decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(ids, dtype=tf.int32), 0),
    input_lengths=tf.expand_dims(tf.convert_to_tensor(len(ids), dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
)
mel_outputs = tf.reshape(mel_outputs, [-1, 80]).numpy()
# melgan inference
audio = mb_melgan.inference(mel_outputs[None, ...])

# save to file
sf.write("./audio_test_fr.wav", audio[0, :, 0], 22050)

Still getting random mel outputs...

dathudeptrai commented 3 years ago

@samuel-lunii can you check if the problem is about Text2mel model or melgan model ?.

samuel-lunii commented 3 years ago

i think it comes from Text2mel model, because I get different results each time I print(mel_outputs) and plot(mel_outputs) :

first time :

[[-1.1685759  -1.2047745  -1.3080378  ... -1.166595   -1.1850848
  -1.1887721 ]
 [-0.9884827  -1.0476902  -1.1208555  ... -1.1260802  -1.1324888
  -1.1231279 ]
 [-0.9577022  -1.0271982  -1.0529574  ... -1.132981   -1.1573727
  -1.148672  ]
 ...
 [-0.39674753 -0.66073245 -0.78027946 ... -0.6853353  -0.67043
  -0.6731563 ]
 [-0.44452816 -0.6084268  -0.6490945  ... -0.5704105  -0.5551065
  -0.5707133 ]
 [-0.14123309 -0.24117121 -0.237127   ... -0.21044426 -0.20983121
  -0.23627539]]

image

Second time :

[[-1.2705129  -1.3017569  -1.4279091  ... -1.1590765  -1.1668264
  -1.1523571 ]
 [-1.1365231  -1.1602129  -1.246326   ... -1.0838338  -1.055539
  -1.0570184 ]
 [-1.0327929  -1.0753589  -1.1244559  ... -1.0430999  -1.0398345
  -1.0334009 ]
 ...
 [-0.21869813 -0.4889331  -0.5723229  ... -0.3848772  -0.44142717
  -0.4273518 ]
 [-0.20579815 -0.43203712 -0.48528364 ... -0.2880336  -0.33902803
  -0.35105032]
 [-0.10610583 -0.24984945 -0.28069073 ... -0.12415372 -0.18096048
  -0.19688737]]

image

Note that even duration is different...

dathudeptrai commented 3 years ago

@samuel-lunii can you try to disable dropout on prenet (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/tacotron2.py#L477) ?

samuel-lunii commented 3 years ago

mmmmh randomness is removed (exact same results each time I print(mel_outputs)) but result is not speech anymore : image

dathudeptrai commented 3 years ago

@samuel-lunii try to use the original class of Model and Config like https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/train_tacotron2.py#L458-L460 then load_weights as normal :D.

samuel-lunii commented 3 years ago

Here is what I did :

# initialize tacotron2 model.
tac2_config = Tacotron2Config(dataset = "synpaflex", n_conv_encoder = 5, n_speakers = 1,
    encoder_conv_activation = 'relu', reduction_factor = 1, prenet_activation="relu") 
tacotron2 = TFTacotron2(config=tac2_config, name="tacotron2")
tacotron2._build()
tacotron2.load_weights("./examples/tacotron2/checkpoints/synpaflex/model-74000.h5")

Still getting random results when dropout is active ! ><

dathudeptrai commented 3 years ago

@samuel-lunii teacher forcing is still ok ?

ZDisket commented 3 years ago

@samuel-lunii

the output is different even if the sentence is kept the exact same

That's a feature of Tacotron 2, one input, infinite possibilities which makes it ideal for emotional voices.

The audio signals sound like a french version of the WaveNet examples where no text has been provided during training

Can you print the input_ids and see what is passed into the model?

dathudeptrai commented 3 years ago

@samuel-lunii let check if ur preprocessed input_ids is corrected or not :))).

samuel-lunii commented 3 years ago

@ZDisket

That's a feature of Tacotron 2, one input, infinite possibilities which makes it ideal for emotional voices.

Ah ok ! good to know :)

Can you print the input_ids and see what is passed into the model?

So here is the output of print(ids) in the above script :

[60, 53, 47, 41, 47, 12, 50, 43, 12, 58, 43, 62, 58, 43, 12, 55, 59, 43, 12, 48, 3, 39, 60, 39, 47, 57, 12, 43, 52, 60, 47, 43, 12, 42, 3, 65, 41, 56, 47, 56, 43, 12, 54, 53, 59, 56, 12, 58, 43, 57, 58, 43, 56, 12, 50, 43, 12, 57, 63, 57, 58, 66, 51, 43, 12, 42, 43, 12, 57, 63, 52, 58, 46, 66, 57, 43, 8, 83]

@dathudeptrai

let check if ur preprocessed input_ids is corrected or not :))).

Here is the synpaflex.py processor I used for preprocessing :

import os
import re

import numpy as np
import soundfile as sf
from dataclasses import dataclass
from tensorflow_tts.processor import BaseProcessor
from tensorflow_tts.utils import cleaners

_pad = "pad"
_eos = "eos"
_punctuation = "!/\'(),-.:;? "
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ"

# Export all symbols:
SYNPAFLEX_SYMBOLS = (
    [_pad] + list(_punctuation) + list(_letters) + [_eos]
)

@dataclass
class SynpaflexProcessor(BaseProcessor):
    """SynPaFlex processor."""

    cleaner_names: str = "basic_cleaners"
    positions = {
        "wave_file": 0,
        "text": 1,
    }
    train_f_name: str = "synpaflex.txt"

    def create_items(self):
        if self.data_dir:
            with open(
                os.path.join(self.data_dir, self.train_f_name), encoding="utf-8"
            ) as f:
                self.items = [self.split_line(self.data_dir, line, "|") for line in f]

    def split_line(self, data_dir, line, split):
        parts = line.strip().split(split)
        wave_file = parts[self.positions["wave_file"]]
        text = parts[self.positions["text"]]
        wav_path = os.path.join(data_dir, "wavs", f"{wave_file}.wav")
        speaker_name = "synpaflex"
        return text, wav_path, speaker_name

    def setup_eos_token(self):
        return _eos

    def get_one_sample(self, item):
        text, wav_path, speaker_name = item

        # normalize audio signal to be [-1, 1], soundfile already norm.
        audio, rate = sf.read(wav_path)
        audio = audio.astype(np.float32)

        # convert text to ids
        text_ids = np.asarray(self.text_to_sequence(text), np.int32)

        sample = {
            "raw_text": text,
            "text_ids": text_ids,
            "audio": audio,
            "utt_id": os.path.split(wav_path)[-1].split(".")[0],
            "speaker_name": speaker_name,
            "rate": rate,
        }

        return sample

    def text_to_sequence(self, text):
        sequence = []
        # Check for curly braces and treat their contents as ARPAbet:
        while len(text):
            m = _curly_re.match(text)
            if not m:
                sequence += self._symbols_to_sequence(
                    self._clean_text(text, [self.cleaner_names])
                )
                break
            sequence += self._symbols_to_sequence(
                self._clean_text(m.group(1), [self.cleaner_names])
            )
            sequence += self._arpabet_to_sequence(m.group(2))
            text = m.group(3)

        # add eos tokens
        sequence += [self.eos_id]
        return sequence

    def _clean_text(self, text, cleaner_names):
        for name in cleaner_names:
            cleaner = getattr(cleaners, name)
            if not cleaner:
                raise Exception("Unknown cleaner: %s" % name)
            text = cleaner(text)
        return text

    def _symbols_to_sequence(self, symbols):
        return [self.symbol_to_id[s] for s in symbols if self._should_keep_symbol(s)]

    def _arpabet_to_sequence(self, text):
        return self._symbols_to_sequence(["@" + s for s in text.split()])

    def _should_keep_symbol(self, s):
        return s in self.symbol_to_id and s != "_" and s != "~"

    def save_pretrained(self, saved_path):
        os.makedirs(saved_path, exist_ok=True)
        self._save_mapper(os.path.join(saved_path, PROCESSOR_FILE_NAME), {})

I compared a XXX-ids.npy file from preprocessed data and I compared it to the result of the text_to_sequence() method above, with the same sentence as input, and it gave the exact same result. I'm not sure that's what you meant though ?

samuel-lunii commented 3 years ago

@dathudeptrai

@samuel-lunii teacher forcing is still ok ?

I tried this :

text = "-- Ohé ! Pierre ralentit l'allure de ses chevaux, se retourna, et à la lueur du réverbère voisin, aperçut un individu, embossé dans un ample manteau, à collet relevé, et coiffé d'un casque [Note : tel est le nom que les Canadiens-Français donnent à leur coiffure d'hiver. Fin de la note.] en fourrure."
ids = processor.text_to_sequence(text)

# tacotron2 teacher forcing
input_data = np.load("./dump_synpaflex/valid/norm-feats/chevalier_filledupirate_010_009-norm-feats.npy")
decoder_outputs, forced_mel_outputs, stop_token_prediction, alignment_history = tacotron2(
    input_ids = tf.expand_dims(tf.convert_to_tensor(np.array(ids), dtype=tf.int32), 0), 
    input_lengths = tf.expand_dims(tf.convert_to_tensor(len(ids), dtype=tf.int32), 0), 
    speaker_ids = tf.convert_to_tensor([0], dtype=tf.int32), 
    mel_gts = tf.expand_dims(tf.convert_to_tensor(input_data, dtype=tf.float32), 0), 
    mel_lengths = tf.convert_to_tensor(len(input_data), dtype=tf.int32),
    )

And it also gave random speech.

dathudeptrai commented 3 years ago

@samuel-lunii can you give me ur alignment figure in training progress (evaluation) and in the real inference with and without teacher forcing in the same input ?

samuel-lunii commented 3 years ago

@dathudeptrai For now, here are the alignment figures with (top) and without (bottom) teacher forcing during inference :

image

samuel-lunii commented 3 years ago

And here is an alignment figure obtained during eval : image

dathudeptrai commented 3 years ago

@samuel-lunii alignment during evaluation seems valid :))). Maybe you can try use teacher forcing to extract duration and training FS2, if FS2 still has this problem, there must be a bug in preprocessing steps :(.

samuel-lunii commented 3 years ago

@dathudeptrai So ! I struggle to extract durations, I will create a new issue about this.

Regarding preprocessed data, i made a script to compare re-synthesized normalized mel spectrograms (norm-feats) to the corresponding processed text (ids) :

#[...] usual imports [...]
import sounddevice as sd
from tensorflow_tts.processor import synpaflex as synpaflex_processor
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--input_mel', type=str, default=./dump_synpaflex/train/norm-feats/hugo_miserables_02_30_068-norm-feats.npy)
args = parser.parse_args() 

mb_config = AutoConfig.from_pretrained("./examples/multiband_melgan/conf/multiband_melgan.v1.yaml")
mb_melgan = TFAutoModel.from_pretrained(melgan_pretrained_path, config=mb_config)

# get ids from ids folder
input_ids_path = args.input_mel.replace("norm-feats","ids")
input_ids = np.load(input_ids_path)

# convert ids to text
processor = synpaflex_processor.SynpaflexProcessor
processor._load_mapper(processor, "./tensorflow_tts/processor/pretrained/synpaflex_mapper.json")
text = processor._sequence_to_symbols(processor, input_ids)
print("".join(text))

# get mel spectrogram from norm-feats folder
mel_outputs = np.load(args.input_mel)
audio = mb_melgan.inference(mel_outputs[None, ...])

# listen to audio
sd.play(audio[0, :, 0], 22050)
status = sd.wait()

Success : the audio output corresponds exactly to the result of print("".join(text)) :

comme le capitaine prononçait ces mots, un éclair illumina les ondes de l'atlantique, puis une détonation se fit entendre et deux boulets ramés balayèrent le pont de l'alcyon.eos

I guess this means there is no error in preprocessing, right ?

For info, here is a (wrong) result of the tacotron TTS using the exact same text as input.

ihshareef commented 3 years ago

Did you find a solution for the issue @samuel-lunii ?

samuel-lunii commented 3 years ago

@ihshareef Not yet, still investigating ! I will let everyone know when I find the solution.

ttslr commented 3 years ago

@ihshareef Not yet, still investigating ! I will let everyone know when I find the solution.

Hi, did you solve this problem? @samuel-lunii I also used my custom data (about 40h) to train the tacotron2 model. The training process seems fine, while in inference stage, various checkpoints (2k-58k) always output random speech for my test data.

Your solution would be greatly appreciated!

ttslr commented 3 years ago

@samuel-lunii hi. I test the performance using my latest checkpoint (76k), the synthesized speech sounds good. Maybe you can further train you model and try again.

Liofang commented 3 years ago

same result on my datasets! I trained fastspeech2 and tacotron2, same datasets, tacotron2 performance better on valid datasets, I saved the mel_output numpy on each eval steps, and reload those numpy, it was good. but when I turn to inference, tacotron2 give me chaos video( but it seems the word from my datasets ) randomly. fastspeech2 don't have this problem.

samuel-lunii commented 3 years ago

@dathudeptrai I managed to extract durations and to train fastspeech2, and the audio output is also nonsense speech. Have you got any idea of what I could be doing wrong in the preprocessing step ? The only difference I can see between SynPaFlex and LJSpeech is that SynPaFlex sample rate is 44.1 kHz, so there is a resampling operation during preprocessing, because I set the sample rate to 22.05 kHz in my synpaflex_preprocess.yaml file. However, I think this should not be a problem.

@ttslr I trained the model up to 76k steps too, so I do not think further training would solve this problem.

dathudeptrai commented 3 years ago

@samuel-lunii can you share with me a duration list value of some samples here ? just want to make sure the duration is valid :D

samuel-lunii commented 3 years ago

@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :

image

the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the trim_threshold_in_db correctly in my synpaflex_processor.yaml file ?

samuel-lunii commented 3 years ago

Here are the duration values of the above sample :

[ 3  6  7  6  7  7  9 10 11 10  7  7  8  6  7  5  5  9 10 14 13  3  6  6
  5  7  5  7  7  6  7  8  6  6  3  6  6  6  6  3  3  2  3  5  5  6  4]
dathudeptrai commented 3 years ago

@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :

image

the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the trim_threshold_in_db correctly in my synpaflex_processor.yaml file ?

yes, it could be a source of ur problem. You should tune the trim_threshold_in_db in the config. I remember that I have added a note in the config file about this issue.

ttslr commented 3 years ago

@dathudeptrai I think I might have found the source of the problem. I made a script that displays ids and boundaries calculated from the durations, and here is the result :

image

the waveform is displayed from preprocessed wavs. We can see that the trimming is not performed properly, because there is a long silence at the beginning and at the end of the sentence. There is a background noise on some samples of the dataset, so maybe the problem comes from the fact that I did not tune the trim_threshold_in_db correctly in my synpaflex_processor.yaml file ?

HI, I also noticed this in the first round, and I trimmed all speech samples with trim_threshold_in_db=20, but the synthesized speech is still random when training step < 80k.

Note that the synthesized speech sounds stable after 80k in my experiments.

dathudeptrai commented 3 years ago

@ttslr Hi, seems you are an expert in this field :D. I saw you have a lot of papers about TTS :D

ttslr commented 3 years ago

@ttslr Hi, seems you are an expert in this field :D. I saw you have a lot of papers about TTS :D

Thank you! I'm just an ordinary TTS researcher. :) 😄

samuel-lunii commented 3 years ago

@dathudeptrai @ttslr

an update : I trained Tacotron2 with the correctly trimmed dataset up to 86k, and still get babbling speech, even with sentences copied from the training set as input.

My guess is that it may come from the multi-GPU training : the maximal length of the sentences in my dataset is 20s, with a sample rate of 22.05 kHz, which does not allow me to train on a single GPU without OOM.

LJSpeech maximal duration is about 10s, so I was able to train it on a single GPU. I will try to reduce the maximal length of my dataset to 10s in order to train it on single GPU, and I will let you know ! :)

samuel-lunii commented 3 years ago

@dathudeptrai @ZDisket @ttslr

Thank you all for your help ! :) So the problem came from multi-GPU training. Single-GPU training gave much better results ! I still get a few prononciation mistakes at 92k, but it might be because I only used 20 hours of speech.

Any idea of what happens during multi-GPU training ? Is the content of each sentence shuffled by accident across the GPUs, which would make the model learn incorrect speech ?

dathudeptrai commented 3 years ago

@dathudeptrai @ZDisket @ttslr

Thank you all for your help ! :) So the problem came from multi-GPU training. Single-GPU training gave much better results ! I still get a few prononciation mistakes at 92k, but it might be because I only used 20 hours of speech.

Any idea of what happens during multi-GPU training ? Is the content of each sentence shuffled by accident across the GPUs, which would make the model learn incorrect speech ?

Good to know that it now worked :D. Can you share a tensorboard of multi-gpu training and single-gpu training ?, it should not be a problem :(. I tried multi-gpu many times and it's still ok, i think the problem is about a global_batch_size. I think global_batch_size should in range(32,64) for tacotron training Tacotron2.

samuel-lunii commented 3 years ago

Multi-GPU : image

Single-GPU : image

Mmmh... I changed batch_size to 8 in the conf file because I used 4 GPUs. This should give global_batch_size = 32, according to this calculation in base_trainer.py

dathudeptrai commented 3 years ago

@samuel-lunii hmm, maybe you should use batch_size = 16 or 32 in each GPU. But see the loss curve, both single-gpu and multi-gpu is still valid to me :D

samuel-lunii commented 3 years ago

But see the loss curve, both single-gpu and multi-gpu is still valid to me :D

I know ! Strange isn't it ? :)

@samuel-lunii hmm, maybe you should use batch_size = 16 or 32 in each GPU.

Yes, maybe... I think I already have good enough results with 1 GPU, so I will close the issue for now. I will let you know if I switch back to multi-GPU training.

dathudeptrai commented 3 years ago

@samuel-lunii pls consider to contribute french model if you can :D

samuel-lunii commented 3 years ago

@samuel-lunii pls consider to contribute french model if you can :D

Of course ! I already sent you an e-mail about that.