Closed tekinek closed 3 years ago
hi @tekinek, i never get nan when training with Tacotron-2 but i can give you some suggests :)):
And also, let pull the newest code and run it with newst tensorflow version may help you solve the nan problem. I guess disble guided attention loss is the solution for nan problem, but let try :v. BTW, can you share ur alignment figure ?
@dathudeptrai thanks for your quick reply. I will try loss_att * 0.0 if my current run gets nan again. Now it is at 51k.
Here are some predicted alignments at 50k steps. Do they look fine? stopnet seems have long way to go, right? :)
@tekinek hmm, it's not as good as ljspeech and other datasets i tried before, the alignment is not strong but i hope it's still enough to get duration for fastspeech2 training with windown masking trick. There is something wrong in ur preprocessing, did you add stop symbols in the end of charactor_ids ?, did you lower all ur text and did you change english cleaner to ur target language cleaner ?
@dathudeptrai
did you add stop symbols in the end of charactor_ids ?
It seems I haven't done that explicitly. Every sentence in the dataset ends with one of ".?!". I've wrote a cleaner and processor based on cleaner.py and ljspeech.py, here is processor ugspeech.py:
import re
import os
import numpy as np
import soundfile as sf
from tensorflow_tts.utils import ugspeech_cleaners
valid_symbols = [
]
_pad = "_"
_eos = "~"
_punctuation = "!'(),.:;?«» "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_arpabet = ["@" + s for s in valid_symbols]
symbols = (
[_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet + [_eos]
)
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")
class UGSpeechProcessor(object):
def __init__(self, root_path, cleaner_names):
self.root_path = root_path
self.cleaner_names = cleaner_names
items = []
self.speaker_name = "ugspeech"
if root_path is not None:
with open(os.path.join(root_path, "metadata.csv"), encoding="utf-8") as ttf:
for line in ttf:
parts = line.strip().split("|")
wav_path = os.path.join(root_path, "wavs", "%s.wav" % parts[0])
text = parts[2]
if len(self.text_to_sequence(text)) > 200:
continue
print(text)
items.append([text, wav_path, self.speaker_name])
self.items = items
def get_one_sample(self, idx):
text, wav_file, speaker_name = self.items[idx]
audio, rate = sf.read(wav_file)
audio = audio.astype(np.float32)
text_ids = np.asarray(self.text_to_sequence(text), np.int32)
sample = {
"raw_text": text,
"text_ids": text_ids,
"audio": audio,
"utt_id": self.items[idx][1].split("/")[-1].split(".")[0],
"speaker_name": speaker_name,
"rate": rate,
}
return sample
def text_to_sequence(self, text):
global _symbol_to_id
sequence = []
while len(text):
m = _curly_re.match(text)
if not m:
sequence += _symbols_to_sequence(
_clean_text(text, [self.cleaner_names])
)
break
sequence += _symbols_to_sequence(
_clean_text(m.group(1), self.cleaner_names)
)
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
return sequence
def _clean_text(text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(ugspeech_cleaners, name)
if not cleaner:
raise Exception("Unknown cleaner: %s" % name)
text = cleaner(text)
return text
def _symbols_to_sequence(symbols):
return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
def _arpabet_to_sequence(text):
return _symbols_to_sequence(["@" + s for s in text.split()])
def _should_keep_symbol(s):
return s in _symbol_to_id and s != "_" and s != "~"
or do you mean I should append id
of _eos
to text_ids
in somewhere before get_one_sample
returns?
did you lower all ur text and did you change english cleaner to ur target language cleaner ?
No, because in the transcript used in dataset, lower and upper cases of same letter represent different character (that is due to more than 26 letters in alphabet of my language).
did you change english cleaner to ur target language cleaner;
Yes, I did.
FYI: I have formatted my dataset into LJSpeech style including folder structure and metadata.csv.
@dathudeptrai restarting from 50k seems solved the "nan" problem
@dathudeptrai where is _eos
actually used in prepossessing with ljspeech.py. Is it supposed to be attached to every sentence in text_to_sequence
whatever phone or character based normalization? but it is not the case there.
@tekinek it is on generator function in tacotron_dataset.py
Hi @dathudeptrai
By your suggestion, I got back to the dataset and prepossessing. Yes, there are some issues: long silences between words, and bad min/max freq level settings for mel-spec.
I realize that inconsistent and long silences in between words are somehow common in my dataset. Sure, almost every utterance has long leading and trailing silence, but they should have been handled by trim_silence = True
before. This time, I did 50% shortening for every silence > 500ms. (By the way, I wrote a script for that; I will share it soon.)
My initial setting for mel-spec min/max freqs were 60-7600. But I found that 0-8000 is much better by doing: ground truth waveform -> mel -> griffin_lim -> waveform -> hearing
Now, I can see alignment is becoming stronger, but model still fails to stop at right location in most cases. What might be other reasons? thanks!
(blue one is a fresh run on newly cleaned data)
@tekinek it seem ok, the alignment is strong enough to extract duration for fastspeech. For stop token, I think the reason is that you don't add the stop_token to the end of sentence. And in you might need to train it to 100k to be able inference without teacher forcing :D.
@dathudeptrai thanks for your quick response.
"you don't add the stop_token to the end of sentence"
How should I interpret this sentence? Should I manually append stoptoken "" to each sentence in my dataset before prepossessing? I see this happening as default behavior in tacotron_dataset.py (not in inference time?)
@tekinek in inference time you should add eos token as tacotron2_dataset does :d
@dathudeptrai I got it, thanks.
@dathudeptrai Sorry, wait a minute. The above figures are taken from predictions folder generated by generate_and_save_intermediate_result
at certain training step. So corresponding sentences should have _eos
appended already.
@tekinek the yellow line you see is padding, everything is fine :)))
Hi @dathudeptrai, I extracted duration using 50k tacotron2 without error and started a fastspeech2 training session. Whilst tac2 has been trained for almost 5 days to just see the door of 70k, fs2 passes 120k within a day and produces better sound (maybe tac2 is yet to be ready).
Here is learning curve and some mels from fs2:
How these figures look to your eyes? what is wrong with energy and f0 losses? One observed problem is: fs2 fails to synthesis short sentence with single word that griffin_limed sound is not understandable at all (tac2 is fine in such cases), but longer sentences are fine though both tac2 and fs2 have more noise compared to mozilla TTS version of tac2:
Thanks!
@tekinek a mel for fastspeech2 very good i think. You need to train mb melgan to get better audio. GL always noise.
Hi @dathudeptrai , When I try to train a multi band melgan model, I got an error syas "Paddings must be non-negative: 0 -6400". It happens in evaluation. Anything wrong with eval data?
[train]: 0%| | 0/4000000 [00:00<?, ?it/s]
2020-08-02 12:54:31.041261: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 2465 of 9238
2020-08-02 12:54:41.040099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 5080 of 9238
2020-08-02 12:54:51.040689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 7644 of 9238
2020-08-02 12:54:57.273513: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-08-02 12:55:11.379315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
[eval]: 6it [00:27, 4.55s/it] | 5000/4000000 [14:12<182:44:23, 6.07it/s]
Traceback (most recent call last):
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
yield
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
output_shapes=self._flat_output_shapes)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/home/karil/.conda/envs/tf2.2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in
@tekinek are u using newest code ? If no let try newest code then i can easily debug
@dathudeptrai Yes, it was an older code base. But updating to the newest introduced new error. It seems your recent update to the multiband_melgan.v1.yaml
is not fully compatible with train_multiband_melgan.py
, where older naming "generator_params" still apears and causes problem when remove_short_samples
is enabled.
Traceback (most recent call last):
File "examples/multiband_melgan/train_multiband_melgan.py", line 492, in <module>
main()
File "examples/multiband_melgan/train_multiband_melgan.py", line 366, in main
] + 2 * config["generator_params"].get("aux_context_window", 0)
KeyError: 'generator_params'
@tekinek let replace generator_params to multiband_generator_params
@dathudeptrai now it seems fine. First validation phase passed without error. For your reference, there are some other naming mismatches betweentrain_multiband_melgan.py
and multiband_melgan.v1.yaml
Hi @dathudeptrai My multiband melgan training seems to be in trouble. Even at ~1m steps, it only generates strong continues BEEP sound. adversarial_loss, dis_loss and fake_loss are almost static. What might be wrong with it? thanks!
@tekinek if i were you, i would stop training after 10k steps :)). Your problem is related to stft loss, there are some samples makes stft loss very high (do not know why, i clip the loss so the nan loss won't happend anymore but clip can't prevent very high loss :D). I think i can solve this problem, let pull the newest code and try replace (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L153-L155) to:
sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
full_sc_loss + full_mag_loss
)
that mean all samples in batch has loss element >= 2.0 will replace to 0.0 then we don't need to backprop these samples :D. Let train again around 30k steps and report the tensorboard here :D. @ZDisket, it may related with ur issue :)).
Addition, the problem may related with pqmf since we can't sure the output of pqmf in range [-1, 1] (you can see the audio output is from -2 -> 4). We can fix this problem by apply tanh function to https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/mb_melgan.py#L156.
Apply 2 things above should help you prevent this problem :)).
I see, thanks @dathudeptrai. me and my dataset seem to be good trouble makers. I will report the update here.
By the way, original samples in my dataset are inconsistent in amplitude. Peak is around -6db in half samples, and -18db in the rest. I did a processioning so that all samples roughly at -6db. Maybe this is relevant.
@dathudeptrai
sub_sc_loss = tf.where(sub_sc_loss >= 2.0, 0.0, sub_sc_loss)
sub_mag_loss = tf.where(sub_mag_loss >= 2.0, 0.0, sub_mag_loss)
full_sc_loss = tf.where(full_sc_loss >= 2.0, 0.0, full_sc_loss)
full_mag_loss = tf.where(full_mag_loss >= 2.0, 0.0, full_mag_loss)
gen_loss = 0.5 * (sub_sc_loss + sub_mag_loss) + 0.5 * (
full_sc_loss + full_mag_loss
)
It seems that this code has filtered out everything?
@tekinek let try only apply tanh :)), and tune the 2.0 value :)), it's just example :)), i think 5.0 -> 10.0 is valid value :D. In the begining of the training progress the loss maybe so high so it filtered everything :D
@dathudeptrai tuning the value 2.0 to 5.0 or 10.0, and applying tanh to syntheiss output (return tf.nn.tanh(tf.nn.conv1d(x, self.synthesis_filter, stride=1, padding="VALID"))
) couldn't solve the problem:(
@tekinek it's already solve man :))). The discriminator will train after 200k steps :)). (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/conf/multiband_melgan.v1.yaml#L97). Ur eval loss seems so very good !.
@dathudeptrai good to know that:) thanks! let's see what will happen after resume at 200k.
@dathudeptrai i am getting beep sound at 75k generator only is it okay ? does only generator training give some kind of output or we have to wait till discriminator starts ?
@manmay-nakhashi it's not ok, around 10k steps it should be audible. This is a know issue of Mb-melgan. Even i adding clip for stft loss and apply tanh function for synthesis but it's still not solve this problem clearly. You should tune the upper bound value of stft loss to prevent this problem. Can you share ur tensorboard ?
@dathudeptrai i'll send you tensorboard image shortly
@dathudeptrai
before clipping
after clipping at 5 and applying tanh , it fixes the issue i guess :))
@tekinek what is ur upper bound value :))). @manmay-nakhashi 5.0 is magic number haha :)) i guess 4.0 is a best number :v.
@dathudeptrai haha i'll try it with 4.0 :P
@dathudeptrai after starting a discriminator it again happened one time but after that it settles down [WARNING] (Step: 205600) train_adversarial_loss = 1.0104. [WARNING] (Step: 205600) train_subband_spectral_convergence_loss = 0.9997. [WARNING] (Step: 205600) train_subband_log_magnitude_loss = 1.1088. [WARNING] (Step: 205600) train_fullband_spectral_convergence_loss = 1.0251. [WARNING] (Step: 205600) train_fullband_log_magnitude_loss = 1.3121. [WARNING] (Step: 205600) train_gen_loss = 4.7488. [WARNING] (Step: 205600) train_real_loss = 0.0664. [WARNING] (Step: 205600) train_fake_loss = 0.1495. [WARNING] (Step: 205600) train_dis_loss = 0.2159. [train]: 5%|███████▍ | 205800/4000000 [47:51<511:38:08, 2.06it/s] [WARNING] (Step: 205800) train_adversarial_loss = 267.4664. [WARNING] (Step: 205800) train_subband_spectral_convergence_loss = 1.0560. [WARNING] (Step: 205800) train_subband_log_magnitude_loss = 1.1541. [WARNING] (Step: 205800) train_fullband_spectral_convergence_loss = 1.0531. [WARNING] (Step: 205800) train_fullband_log_magnitude_loss = 1.3672. [WARNING] (Step: 205800) train_gen_loss = 670.9814. [WARNING] (Step: 205800) train_real_loss = 16.8144. [WARNING] (Step: 205800) train_fake_loss = 1557.5889. [WARNING] (Step: 205800) train_dis_loss = 1574.4030.
i was looking into discriminator loss and it doesn't have real vs fake loss in master branch is it needed ?
if self.steps >= self.config["discriminator_train_start_steps"]:
p_hat = self._discriminator(y_hat)
p = self._discriminator(tf.expand_dims(audios, 2))
adv_loss = 0.0
for i in range(len(p_hat)):
adv_loss += calculate_3d_loss(
tf.ones_like(p_hat[i][-1]), p_hat[i][-1], loss_fn=self.mse_loss
)
adv_loss /= i + 1
gen_loss += self.config["lambda_adv"] * adv_loss
dict_metrics_losses.update({"adversarial_loss": adv_loss},)
**# is real and fake loss calculation is needed in discriminator ?? **
# discriminator
p = self.discriminator(tf.expand_dims(y, 2))
p_hat = self.discriminator(y_hat)
real_loss = 0.0
fake_loss = 0.0
for i in range(len(p)):
real_loss += self.mse_loss(p[i][-1], tf.ones_like(p[i][-1], tf.float32))
fake_loss += self.mse_loss(
p_hat[i][-1], tf.zeros_like(p_hat[i][-1], tf.float32)
)
real_loss /= i + 1
fake_loss /= i + 1
dis_loss = real_loss + fake_loss
@manmay-nakhashi so for now, everything is still ok ?. I think we should apply sigmoid function for discriminator :))). Can you try apply sigmoid for the last convolution ? here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L411-L416). Then retrain and report the training progress here ?
@dathudeptrai generator trained properly till 200k steps once i start discriminator it becomes unstable after 5k steps i'll make that change and post tensorboard over here
@manmay-nakhashi real/fake loss computed here (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py#L179-L198)
@dathudeptrai it's been 20k steps and traning is mimicking english graph pattern so i am hoping it'll converge better after sometime. i'll post tensorboard after 50k training steps
hi @dathudeptrai
opps here:( My mb-melgan training still seems problematic. Mine was 10.0 clip for stft losses and tanh to synthesis output. Should I try 4.0, and is resuming from 200k fine?
@tekinek what is ur discriminator parameter?
@dathudeptrai i have tried sigmoid function , but as discriminator starts it starts adding beep to the waveform , then i replaced it with swish and it started working for me but there is an edge effect in the audio "straight spikes" , i think can be handled with padding or filtering (or may be it'll go away as model converges )
@dathudeptrai I haven't touch the defaults.
@tekinek there is no problem about stft loss in ur tensorboard. The problem is about discriminator :D. Let check ur current code and this line (https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/melgan.py#L379-L380).
@dathudeptrai have you encountered edge effects in initial discriminator training ?
@dathudeptrai
It is like this:
discriminator += [
GroupConv1D(
filters=out_chs,
kernel_size= * 10 + 1,
strides=downsample_scale,
padding="same",
use_bias=use_bias,
groups=in_chs // 4,
kernel_initializer=get_initializer(initializer_seed),
)
]
A quick debug shows values of all that downsample_scales are 4.
@tekinek what is a number of parameter on ur discriminator ?. All downsample_sacles are 4 is correct.
@dathudeptrai Parameter number of discriminator is 3,981,507
FYI: this pytorch implementation of mb-melgan worked before with the same dataset.
Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.
So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again. By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?
My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.
Which step seems best for duration extraction so far?
How can I improve the quality of duration extraction? extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by
alignment = alignment[:real_char_length, :real_mel_length]
, this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?
Any idea on MFA model training for a language with no phone dictionary available? Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme. .... hello h e l l o nice n i c e ....
Thanks.