TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.81k stars 811 forks source link

Tacotron2: Everything become nan at 53k steps #125

Closed tekinek closed 3 years ago

tekinek commented 4 years ago

Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.

  1. When training process reaches at 53.5k steps, the model seems lost "everything". The values of train, eval losses and model predictions became nan (but training continues without reporting exception). tensorboard1

So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again. By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?

  1. My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.

    Which step seems best for duration extraction so far?

  2. How can I improve the quality of duration extraction? extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by alignment = alignment[:real_char_length, :real_mel_length], this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).

    So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?

  3. Any idea on MFA model training for a language with no phone dictionary available? Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme. .... hello h e l l o nice n i c e ....

Thanks.

dathudeptrai commented 4 years ago

hi @tekinek, i never get nan when training with Tacotron-2 but i can give you some suggests :)):

  1. Can you disble guided attention loss when resume training at 50k steps. You can do this by simply multiple loss_att to 0.0.
  2. Around 60k->80k is ok for duration extract, ur 50k steps is also enough :v.
  3. when extract duration based on tacotron-2, we use teacher forcing, that mean the prev mel is groundtruth so that ok.
  4. Let me think :))).

And also, let pull the newest code and run it with newst tensorflow version may help you solve the nan problem. I guess disble guided attention loss is the solution for nan problem, but let try :v. BTW, can you share ur alignment figure ?

tekinek commented 4 years ago

@dathudeptrai thanks for your quick reply. I will try loss_att * 0.0 if my current run gets nan again. Now it is at 51k.

Here are some predicted alignments at 50k steps. Do they look fine? stopnet seems have long way to go, right? :)

7_alignment 14_alignment 2_alignment 12_alignment 11_alignment 1_alignment

dathudeptrai commented 4 years ago

@tekinek hmm, it's not as good as ljspeech and other datasets i tried before, the alignment is not strong but i hope it's still enough to get duration for fastspeech2 training with windown masking trick. There is something wrong in ur preprocessing, did you add stop symbols in the end of charactor_ids ?, did you lower all ur text and did you change english cleaner to ur target language cleaner ?

tekinek commented 4 years ago

@dathudeptrai

did you add stop symbols in the end of charactor_ids ?

It seems I haven't done that explicitly. Every sentence in the dataset ends with one of ".?!". I've wrote a cleaner and processor based on cleaner.py and ljspeech.py, here is processor ugspeech.py:

import re
import os
import numpy as np
import soundfile as sf

from tensorflow_tts.utils import ugspeech_cleaners

valid_symbols = [
]

_pad = "_"
_eos = "~"
_punctuation = "!'(),.:;?«» "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_arpabet = ["@" + s for s in valid_symbols]

symbols = (
    [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet + [_eos]
)

_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")

class UGSpeechProcessor(object):
    def __init__(self, root_path, cleaner_names):
        self.root_path = root_path
        self.cleaner_names = cleaner_names
        items = []
        self.speaker_name = "ugspeech"
        if root_path is not None:
            with open(os.path.join(root_path, "metadata.csv"), encoding="utf-8") as ttf:
                for line in ttf:
                    parts = line.strip().split("|")
                    wav_path = os.path.join(root_path, "wavs", "%s.wav" % parts[0])
                    text = parts[2]
                    if len(self.text_to_sequence(text)) > 200: 
                        continue
                    print(text)
                    items.append([text, wav_path, self.speaker_name])

            self.items = items

    def get_one_sample(self, idx):
        text, wav_file, speaker_name = self.items[idx]
        audio, rate = sf.read(wav_file)
        audio = audio.astype(np.float32)
        text_ids = np.asarray(self.text_to_sequence(text), np.int32)
        sample = {
            "raw_text": text,
            "text_ids": text_ids,
            "audio": audio,
            "utt_id": self.items[idx][1].split("/")[-1].split(".")[0],
            "speaker_name": speaker_name,
            "rate": rate,
        }

        return sample

    def text_to_sequence(self, text):
        global _symbol_to_id

        sequence = []
        while len(text):
            m = _curly_re.match(text)
            if not m:
                sequence += _symbols_to_sequence(
                    _clean_text(text, [self.cleaner_names])
                )
                break
            sequence += _symbols_to_sequence(
                _clean_text(m.group(1), self.cleaner_names)
            )
            sequence += _arpabet_to_sequence(m.group(2))
            text = m.group(3)
        return sequence

def _clean_text(text, cleaner_names):
    for name in cleaner_names:
        cleaner = getattr(ugspeech_cleaners, name)
        if not cleaner:
            raise Exception("Unknown cleaner: %s" % name)
        text = cleaner(text)
    return text

def _symbols_to_sequence(symbols):
    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]

def _arpabet_to_sequence(text):
    return _symbols_to_sequence(["@" + s for s in text.split()])

def _should_keep_symbol(s):
    return s in _symbol_to_id and s != "_" and s != "~"

or do you mean I should append id of _eos to text_ids in somewhere before get_one_sample returns?

did you lower all ur text and did you change english cleaner to ur target language cleaner ?

No, because in the transcript used in dataset, lower and upper cases of same letter represent different character (that is due to more than 26 letters in alphabet of my language).

did you change english cleaner to ur target language cleaner;

Yes, I did.

FYI: I have formatted my dataset into LJSpeech style including folder structure and metadata.csv.

tekinek commented 4 years ago

@dathudeptrai restarting from 50k seems solved the "nan" problem

tensorboard2
tekinek commented 4 years ago

@dathudeptrai where is _eos actually used in prepossessing with ljspeech.py. Is it supposed to be attached to every sentence in text_to_sequence whatever phone or character based normalization? but it is not the case there.

dathudeptrai commented 4 years ago

@tekinek it is on generator function in tacotron_dataset.py

tekinek commented 4 years ago

Hi @dathudeptrai

By your suggestion, I got back to the dataset and prepossessing. Yes, there are some issues: long silences between words, and bad min/max freq level settings for mel-spec.

I realize that inconsistent and long silences in between words are somehow common in my dataset. Sure, almost every utterance has long leading and trailing silence, but they should have been handled by trim_silence = True before. This time, I did 50% shortening for every silence > 500ms. (By the way, I wrote a script for that; I will share it soon.)

My initial setting for mel-spec min/max freqs were 60-7600. But I found that 0-8000 is much better by doing: ground truth waveform -> mel -> griffin_lim -> waveform -> hearing

Now, I can see alignment is becoming stronger, but model still fails to stop at right location in most cases. What might be other reasons? thanks!

(blue one is a fresh run on newly cleaned data) tensorboard3

tensorboard4

6_alignment 7_alignment-2 8_alignment 9_alignment 10_alignment-2 1 1_alignment-2 3_alignment 4_alignment 5_alignment 14_alignment-2 15_alignment

dathudeptrai commented 4 years ago

@tekinek it seem ok, the alignment is strong enough to extract duration for fastspeech. For stop token, I think the reason is that you don't add the stop_token to the end of sentence. And in you might need to train it to 100k to be able inference without teacher forcing :D.

tekinek commented 4 years ago

@dathudeptrai thanks for your quick response.

"you don't add the stop_token to the end of sentence"

How should I interpret this sentence? Should I manually append stoptoken "" to each sentence in my dataset before prepossessing? I see this happening as default behavior in tacotron_dataset.py (not in inference time?)

dathudeptrai commented 4 years ago

@tekinek in inference time you should add eos token as tacotron2_dataset does :d

tekinek commented 4 years ago

@dathudeptrai I got it, thanks.

tekinek commented 4 years ago

@dathudeptrai Sorry, wait a minute. The above figures are taken from predictions folder generated by generate_and_save_intermediate_result at certain training step. So corresponding sentences should have _eos appended already.

dathudeptrai commented 4 years ago

@tekinek the yellow line you see is padding, everything is fine :)))

tekinek commented 4 years ago

Hi @dathudeptrai, I extracted duration using 50k tacotron2 without error and started a fastspeech2 training session. Whilst tac2 has been trained for almost 5 days to just see the door of 70k, fs2 passes 120k within a day and produces better sound (maybe tac2 is yet to be ready).

Here is learning curve and some mels from fs2:

fs2_eval fs2_train

fs2_2 fs2_3 fs2_4

How these figures look to your eyes? what is wrong with energy and f0 losses? One observed problem is: fs2 fails to synthesis short sentence with single word that griffin_limed sound is not understandable at all (tac2 is fine in such cases), but longer sentences are fine though both tac2 and fs2 have more noise compared to mozilla TTS version of tac2:

fs_short_word

Thanks!

dathudeptrai commented 4 years ago

@tekinek a mel for fastspeech2 very good i think. You need to train mb melgan to get better audio. GL always noise.

tekinek commented 4 years ago

Hi @dathudeptrai , When I try to train a multi band melgan model, I got an error syas "Paddings must be non-negative: 0 -6400". It happens in evaluation. Anything wrong with eval data?

[train]: 0%| | 0/4000000 [00:00<?, ?it/s] 2020-08-02 12:54:31.041261: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 2465 of 9238 2020-08-02 12:54:41.040099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 5080 of 9238 2020-08-02 12:54:51.040689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 7644 of 9238 2020-08-02 12:54:57.273513: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled. 2020-08-02 12:55:11.379315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 [eval]: 6it [00:27, 4.55s/it] | 5000/4000000 [14:12<182:44:23, 6.07it/s] Traceback (most recent call last): File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode yield File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal output_shapes=self._flat_output_shapes)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400 [[{{node cond_4/else/_38/Pad}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in main() File "examples/multiband_melgan/train_multiband_melgan.py", line 484, in main resume=args.resume, File "./tensorflow_tts/trainers/base_trainer.py", line 587, in fit self.run() File "./tensorflow_tts/trainers/base_trainer.py", line 101, in run self._train_epoch() File "./tensorflow_tts/trainers/base_trainer.py", line 127, in _train_epoch self._check_eval_interval() File "./tensorflow_tts/trainers/base_trainer.py", line 164, in _check_eval_interval self._eval_epoch() File "./tensorflow_tts/trainers/base_trainer.py", line 422, in _eval_epoch tqdm(self.eval_data_loader, desc="[eval]"), 1 File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tqdm/std.py", line 1129, in iter for obj in iterable: File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in next return self.get_next() File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 316, in get_next self._iterators[i].get_next_as_list_static_shapes(new_name)) File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 1112, in get_next_as_list_static_shapes return self._iterator.get_next() File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 581, in get_next result.append(self._device_iterators[i].get_next()) File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 741, in get_next return self._next_internal() File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/home/karil/.conda/envs/tf2.2/lib/python3.7/contextlib.py", line 130, in exit self.gen.throw(type, value, traceback) File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode executor_new.wait() File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: Paddings must be non-negative: 0 -6400 [[{{node cond_4/else/_38/Pad}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [train]: 0%|▏ | 5000/4000000 [14:40<195:28:46, 5.68it/s]

dathudeptrai commented 4 years ago

@tekinek are u using newest code ? If no let try newest code then i can easily debug

tekinek commented 4 years ago

@dathudeptrai Yes, it was an older code base. But updating to the newest introduced new error. It seems your recent update to the multiband_melgan.v1.yaml is not fully compatible with train_multiband_melgan.py, where older naming "generator_params" still apears and causes problem when remove_short_samples is enabled.

Traceback (most recent call last):
  File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in <module>
    main()
  File "examples/multiband_melgan/train_multiband_melgan.py", line 366, in main
    ] + 2 * config["generator_params"].get("aux_context_window", 0)
KeyError: 'generator_params'
dathudeptrai commented 4 years ago

@tekinek let replace generator_params to multiband_generator_params

tekinek commented 4 years ago

@dathudeptrai now it seems fine. First validation phase passed without error. For your reference, there are some other naming mismatches betweentrain_multiband_melgan.py and multiband_melgan.v1.yaml

https://github.com/TensorSpeech/TensorFlowTTS/blob/ddd3caf20bb0032852f04296f9192be6e80f3caf/examples/multiband_melgan/train_multiband_melgan.py#L430-L440

tekinek commented 4 years ago

Hi @dathudeptrai My multiband melgan training seems to be in trouble. Even at ~1m steps, it only generates strong continues BEEP sound. adversarial_loss, dis_loss and fake_loss are almost static. What might be wrong with it? thanks!

melgan_tb1

melgan_ref1

dathudeptrai commented 4 years ago

@tekinek if i were you, i would stop training after 10k steps :)). Your problem is related to stft loss, there are some samples makes stft loss very high (do not know why, i clip the loss so the nan loss won't happend anymore but clip can't prevent very high loss :D). I think i can solve this problem, let pull the newest code and try replace (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L153-L155) to:

sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
     full_sc_loss + full_mag_loss
)

that mean all samples in batch has loss element >= 2.0 will replace to 0.0 then we don't need to backprop these samples :D. Let train again around 30k steps and report the tensorboard here :D. @ZDisket, it may related with ur issue :)).

Addition, the problem may related with pqmf since we can't sure the output of pqmf in range [-1, 1] (you can see the audio output is from -2 -> 4). We can fix this problem by apply tanh function to https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/mb_melgan.py#L156.

Apply 2 things above should help you prevent this problem :)).

tekinek commented 4 years ago

I see, thanks @dathudeptrai. me and my dataset seem to be good trouble makers. I will report the update here.

By the way, original samples in my dataset are inconsistent in amplitude. Peak is around -6db in half samples, and -18db in the rest. I did a processioning so that all samples roughly at -6db. Maybe this is relevant.

tekinek commented 4 years ago

@dathudeptrai

sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
     full_sc_loss + full_mag_loss
)

It seems that this code has filtered out everything?

melgan_err2

dathudeptrai commented 4 years ago

@tekinek let try only apply tanh :)), and tune the 2.0 value :)), it's just example :)), i think 5.0 -> 10.0 is valid value :D. In the begining of the training progress the loss maybe so high so it filtered everything :D

tekinek commented 4 years ago

@dathudeptrai tuning the value 2.0 to 5.0 or 10.0, and applying tanh to syntheiss output (return tf.nn.tanh(tf.nn.conv1d(x, self.synthesis_filter, stride=1, padding="VALID"))) couldn't solve the problem:(

melgan_err3_5 0

dathudeptrai commented 4 years ago

@tekinek it's already solve man :))). The discriminator will train after 200k steps :)). (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml#L97). Ur eval loss seems so very good !.

tekinek commented 4 years ago

@dathudeptrai good to know that:) thanks! let's see what will happen after resume at 200k.

manmay-nakhashi commented 4 years ago

@dathudeptrai i am getting beep sound at 75k generator only is it okay ? does only generator training give some kind of output or we have to wait till discriminator starts ?

dathudeptrai commented 4 years ago

@manmay-nakhashi it's not ok, around 10k steps it should be audible. This is a know issue of Mb-melgan. Even i adding clip for stft loss and apply tanh function for synthesis but it's still not solve this problem clearly. You should tune the upper bound value of stft loss to prevent this problem. Can you share ur tensorboard ?

manmay-nakhashi commented 4 years ago

@dathudeptrai i'll send you tensorboard image shortly

manmay-nakhashi commented 4 years ago

@dathudeptrai

before clipping mb_melgan

after clipping at 5 and applying tanh , it fixes the issue i guess :)) mb_melgan_after_clipping_5_tanh

dathudeptrai commented 4 years ago

@tekinek what is ur upper bound value :))). @manmay-nakhashi 5.0 is magic number haha :)) i guess 4.0 is a best number :v.

manmay-nakhashi commented 4 years ago

@dathudeptrai haha i'll try it with 4.0 :P

manmay-nakhashi commented 4 years ago

@dathudeptrai after starting a discriminator it again happened one time but after that it settles down [WARNING] (Step: 205600) train_adversarial_loss = 1.0104. [WARNING] (Step: 205600) train_subband_spectral_convergence_loss = 0.9997. [WARNING] (Step: 205600) train_subband_log_magnitude_loss = 1.1088. [WARNING] (Step: 205600) train_fullband_spectral_convergence_loss = 1.0251. [WARNING] (Step: 205600) train_fullband_log_magnitude_loss = 1.3121. [WARNING] (Step: 205600) train_gen_loss = 4.7488. [WARNING] (Step: 205600) train_real_loss = 0.0664. [WARNING] (Step: 205600) train_fake_loss = 0.1495. [WARNING] (Step: 205600) train_dis_loss = 0.2159. [train]: 5%|███████▍ | 205800/4000000 [47:51<511:38:08, 2.06it/s] [WARNING] (Step: 205800) train_adversarial_loss = 267.4664. [WARNING] (Step: 205800) train_subband_spectral_convergence_loss = 1.0560. [WARNING] (Step: 205800) train_subband_log_magnitude_loss = 1.1541. [WARNING] (Step: 205800) train_fullband_spectral_convergence_loss = 1.0531. [WARNING] (Step: 205800) train_fullband_log_magnitude_loss = 1.3672. [WARNING] (Step: 205800) train_gen_loss = 670.9814. [WARNING] (Step: 205800) train_real_loss = 16.8144. [WARNING] (Step: 205800) train_fake_loss = 1557.5889. [WARNING] (Step: 205800) train_dis_loss = 1574.4030.

i was looking into discriminator loss and it doesn't have real vs fake loss in master branch is it needed ?

if self.steps >= self.config["discriminator_train_start_steps"]:
            p_hat = self._discriminator(y_hat)
            p = self._discriminator(tf.expand_dims(audios, 2))
            adv_loss = 0.0
            for i in range(len(p_hat)):
                adv_loss += calculate_3d_loss(
                    tf.ones_like(p_hat[i][-1]), p_hat[i][-1], loss_fn=self.mse_loss
                )
            adv_loss /= i + 1
            gen_loss += self.config["lambda_adv"] * adv_loss

            dict_metrics_losses.update({"adversarial_loss": adv_loss},)
           **# is real and fake loss calculation is needed in discriminator ?? ** 
           # discriminator
            p = self.discriminator(tf.expand_dims(y, 2))
            p_hat = self.discriminator(y_hat)
            real_loss = 0.0
            fake_loss = 0.0
            for i in range(len(p)):
                real_loss += self.mse_loss(p[i][-1], tf.ones_like(p[i][-1], tf.float32))
                fake_loss += self.mse_loss(
                    p_hat[i][-1], tf.zeros_like(p_hat[i][-1], tf.float32)
                )
            real_loss /= i + 1
            fake_loss /= i + 1
            dis_loss = real_loss + fake_loss
dathudeptrai commented 4 years ago

@manmay-nakhashi so for now, everything is still ok ?. I think we should apply sigmoid function for discriminator :))). Can you try apply sigmoid for the last convolution ? here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L411-L416). Then retrain and report the training progress here ?

manmay-nakhashi commented 4 years ago

@dathudeptrai generator trained properly till 200k steps once i start discriminator it becomes unstable after 5k steps i'll make that change and post tensorboard over here

dathudeptrai commented 4 years ago

@manmay-nakhashi real/fake loss computed here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L179-L198)

manmay-nakhashi commented 4 years ago

@dathudeptrai it's been 20k steps and traning is mimicking english graph pattern so i am hoping it'll converge better after sometime. i'll post tensorboard after 50k training steps

tekinek commented 4 years ago

hi @dathudeptrai

opps here:( My mb-melgan training still seems problematic. Mine was 10.0 clip for stft losses and tanh to synthesis output. Should I try 4.0, and is resuming from 200k fine?

melgan_err1_0807

melgan_err2_0807

dathudeptrai commented 4 years ago

@tekinek what is ur discriminator parameter?

manmay-nakhashi commented 4 years ago

@dathudeptrai i have tried sigmoid function , but as discriminator starts it starts adding beep to the waveform , then i replaced it with swish and it started working for me but there is an edge effect in the audio "straight spikes" , i think can be handled with padding or filtering (or may be it'll go away as model converges )

tekinek commented 4 years ago

@dathudeptrai I haven't touch the defaults.

dathudeptrai commented 4 years ago

@tekinek there is no problem about stft loss in ur tensorboard. The problem is about discriminator :D. Let check ur current code and this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L379-L380).

manmay-nakhashi commented 4 years ago

@dathudeptrai have you encountered edge effects in initial discriminator training ?

tekinek commented 4 years ago

@dathudeptrai

It is like this:

      discriminator += [
                    GroupConv1D(
                        filters=out_chs,
                        kernel_size= * 10 + 1,
                        strides=downsample_scale,
                        padding="same",
                        use_bias=use_bias,
                        groups=in_chs // 4,
                        kernel_initializer=get_initializer(initializer_seed),
                    )
                ]

A quick debug shows values of all that downsample_scales are 4.

dathudeptrai commented 4 years ago

@tekinek what is a number of parameter on ur discriminator ?. All downsample_sacles are 4 is correct.

tekinek commented 4 years ago

@dathudeptrai Parameter number of discriminator is 3,981,507

tekinek commented 4 years ago

FYI: this pytorch implementation of mb-melgan worked before with the same dataset.

https://github.com/TensorSpeech/TensorFlowTTS/blob/7d9e497592e90c0d2da8664e92ae0ae2d5e2b174/tensorflow_tts/models/melgan.py#L379-L380