Closed begeekmyfriend closed 6 years ago
Hello @begeekmyfriend, awesome results, thank you very much for sharing them!
Oh okey.. I didn't expect to find a 1min long audio! I'm not Mandarin expert but the overall results sound okay.
About the audio quality, as I'm used to say, inverting Mel spectrograms directly can be challenging and will always have that "robotic" noise in it, I am currently working on integrating Wavenet as Vocoder which will give the human like speech quality so stay tuned!
Being the perfectionist I am, I will try figuring out a way to improve the predicted mels even more however, I won't settle for less than a perfect match with real spectros :) (even though more training should improve quality a bit)
Awesome work, feel free to share any further results, trained models, or even merge the branches to add mandarin support if you like!
The audio has been concatenate with 12 separated wav files from 12 sentences by FFmpeg...
The point I appreciate is that there need to be only less dataset to be feed to train out an effective model on your project. You know it is usually tough for many people to make out a long time corpus and sanity check is valuable as well. I am going to train more steps for evaluation and I will also post my results at that time.
I would mark this result as a baseline for all the tacotron projects I have ever seen. And I believe your work will help other authors to improve their jobs. Thanks again!
Yeah I noticed tge concatenation :)
Thank you sir for you support, I'm glad this work suits your needs.
Hopefully we can achieve better results in the upcoming days!
Looking forward for your results!
Terminal_train_log.zip It seems there is vibration for loss value during the training at 100+K steps. Is there any hyper parameter to be adjusted such as decay learning rate? ... By the way, the loss value dropped to 0.3 at 107K steps after the vibration.
Hello again @begeekmyfriend, thanks for reporting that.
I noticed these vibrations as well. After having a second look at the learning rate decay function I noticed the horrible mistake I was making in the computation of this decay.. Anyway learning rates were bigger than what they are supposed to be which explains those vibrations..
Crazy how a typo can make your life a hell.. :) it's fixed now, and learning rate decay is set to start after 50k steps and reach its minimal value (1e-5) at 150k steps. (d28cfa9a77afc87902100bd5b2113fbb8541227e)
Here's what the new learning rate evolution looks like (lr vs training_steps):
Since in the T2 paper, decay parameters were not specified, I tried optimizing those params for our case, so they might need some extra tweaking.
You will also notice a faster learning of attention and a faster loss drop.. From these two observations one could guess that model quality will improve, we'll find out once training is finished.
Note: i messed around with the preprocessing a bit. You maybe want to keep that in mind in case it affects your preprocessed data. (Made changes in the indices of the feeder and metadata). I also took off lowercasing in the cleaners because I noticed it wasn't useful for english cleaners. If you however need that lowercasing for mandarin you can put it back in tacotron/utils/cleaners.py.
@begeekmyfriend Hi Leo ,Thanks for your work. which dataset are you using, thchs30?
No, I am using my private dataset. THCHS30 is feasible as well.
Okey you tell me how it goes then, I'll be running few tests on my own too. Hopefully we find the optimal scheme.
On Mon, 16 Apr 2018, 03:04 Leo Ma, notifications@github.com wrote:
@Rayhane-mamah https://github.com/Rayhane-mamah I am trying Noam Scheme as Keith Ito's tacotron does. It is also mentioned in tensor2tensor tensorflow/tensor2tensor#280 https://github.com/tensorflow/tensor2tensor/issues/280
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-381459677, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwD5wZHF3jVaGeXSBJ-Zh1DfHDe1Oks5to_wlgaJpZM4TSyum .
I have changed the hyper parameter on learning rate decay as follows since there is vibration on loss value after 10K steps.
tacotron_start_decay = 10000, #Step at which learning decay starts
tacotron_decay_steps = 10000, #starting point for learning rate decay (and determines the decay slope) (UNDER TEST)
tacotron_decay_rate = 0.33, #learning rate decay rate (UNDER TEST)
train.log.zip
I think the loss value decreases very slow in your tacotron according to that in Keith Ito's implementation. In train.log
we can see the avg_loss
decreasing smoothly and in Terminal_train_log
it often freezes for a long time.
@begeekmyfriend, you provided my logs of a previous repo version, did you upgrade to use latest one? I am seeing loss "explosions" in the log, I hope you're note referring to those when you say loss is vibrating..
Just to be sure we're talking about the same thing, here's what loss vibrations look like to me:
After a smoothing of 0.93:
The overall loss is still decreasing slowly. If by vibrations you're still referring to these explosions, you can set tacotron_scale_regularization = True in hparams.py, should take care of it.
I also want to point out that T1 and T2 use different loss functions and we do not rescale our mel outputs the same way. (I normalize my mels to [-4, 4]).
You can change this option back to [0, 1] by setting in hparams.py:
symmetric_mels = False,
max_abs_value = 1.,
@begeekmyfriend Thanks for you contribution. :) I am applying THCHS30 as the speech corpus here, however I find out an "interesting" phenomenon:
Step 2500 [1.775 sec/step, loss=0.84299, avg_loss=0.84627]
Writing summary at step: 2500
Saving checkpoint to: logs-Tacotron/pretrained/model.ckpt-2500
Saving alignment, Mel-Spectrograms and griffin-lim inverted waveform..
Input at step 2500: tantwo huaone yitwo xianfour defive wangtwo shuone wentwo gaithree getwo youtwo yutwo faone shengone zaifour tangtwo shunfour zongone yongthree zhenone niantwo jianone suotwo yithree youfour beifour chengone weitwo yongthree zhenone getwo xinone~________________________________________________________________________________________
The tones of pinyin (presenting as 1,2,3,4,5) will be automatically converted into one, two, three, four, five. :( The original input at step 2500 should be:
Input at step 2500: tan2 hua1 yi2 xian4...
How could I correct this fatal input? :(
@DavidAksnes It never happens to me...
Hello david,
Use basic cleaners instead of english cleaners in hparams.py
For english numbers are converted to normalized words.
On Wed, 18 Apr 2018, 10:58 Leo Ma, notifications@github.com wrote:
@DavidAksnes https://github.com/DavidAksnes It never happens to me...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382333191, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwI6S0PLjJaw0fHiOdBVh360P4O4jks5tpw4tgaJpZM4TSyum .
train.log.zip Above all are the tensorboard and the training log with tacotron model of yours and Keith Ito's. I have run your latest version with learning rate as follows:
tacotron_start_decay = 50000, #Step at which learning decay starts
tacotron_decay_steps = 25000, #starting point for learning rate decay (and determines the decay slope) (UNDER TEST)
tacotron_decay_rate = 0.33, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate
tacotron_final_learning_rate = 1e-5, #minimal learning rate
I used the same dataset as THCHS-30 with 34h long. We can see that the loss value is still very high. In my experience, loss less than 0.1 is the acceptable result as Ito's tacotron achieved.
I use patch of @begeekmyfriend for CHS and My machine is slow only goes to 20K step now(run from yesterday), the loss stays at 0.6x, and same thing happens to the original repo of @Rayhane-mamah(0.5x up to 60k+)
@begeekmyfriend and @butterl, About the loss values, it would be preferable if you don't try comparing between my work and keithito's. Let me explain why such comparison isn't useful:
Tacotron Feature prediction network outputs are mel-spectrograms, we can consider them as matrices of shape [mel_frames, 80]. Let me take the examples of [100, 80] -> the model will have to predict 8000 values, one vector of [1, 80] at a time (or matrix of [5, 80] at a time if reduction factor r=5).
Here, I took the liberty to plot two random distributions of 8000 points, with different means and std values.
What I want to bring your attention to here is the fact that when preprocessing the data, I tend to create bigger distances between data points (in our case, distance between brown and red). You might wonder why I do that, here's the reason.
If you want to scale your data to [0, 1] like in keithito's work please make these changes in hparams.py:
symmetric_mels = False, max_abs_value = 1.,
Loss will drop to 3e-3.. the thing is, loss value itself is absolute and is not normalized by data norm or anything, so comparing losses from two differently distributed outputs is a little extreme..
I also want to point out that by widening the output range to [-4, 4], we were able to reduce blurriness in outputs probably due to the added penalization on L2 loss.
With this said, I also experienced better output quality with outputs_per_step=1 (i.e: no reduction factor, predict outputs one frame at a time). It's actually predictable since it's easier for the model to predict next mel frame only instead of trying to predict 5 frames at a time.
Attention is also working:
I however noticed that the model is moving forward too fast, leading to a fast reading when synthesizing without teacher forcing, I'm currently trying to determine the cause and set it as a user choice (despite the reading speed, everything is audible and less robotic).
Please note that using r=1 will cause a training and synthesis slowdown of about x2.5~x3. and batch size will probably need to be dropped down to 32.
I wonder if dropping the learning rate this fast is beneficial.. I'll do some experiments on that and tell you how it goes.
Note: Mel spectros will always be blurry when trained with a L2 loss, but it isn't a problem for the wavenet. paper reference: "However, when trained on ground truth features and made to synthesize from predicted features, the result is worse than the opposite. This is due to the tendency of the predicted spectrograms to be oversmoothed and less detailed than the ground truth – a consequence of the squared error loss optimized by the feature prediction network. When trained on ground truth spectrograms, the network does not learn to generate high quality speech waveforms from oversmoothed features."
So if you happen to have a pretrained Wavenet model on ground truth labels, it will most likely not give human like speech quality. Retrain the vocoder instead on the GTA synthesized outputs of the Frame Prediction Network.
I see. Thank you for your explain. I am afraid I have to look closely into it and make myself understand your meaning:-)
I know i'm not the best at giving lectures or explain stuff to people..
If there is anything I need to develop or explain further please let me know.
On Thu, 19 Apr 2018, 10:42 Leo Ma, notifications@github.com wrote:
I see. Thank you for your explain. I am afraid I have to look closely into it and make myself understand your meaning:-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382674958, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwI5MjwsmzCurSZbwGSkjvbQ9jfQDks5tqFv1gaJpZM4TSyum .
Thanks Rayhane-mamah , tried both thchs30 and LJspeech datasets, the loss still in 0.5x even goes to 60k+ steps and for begeekmyfriend's plot 4.x even to 10k+, while in your 46000 step plot the loss is 0.23x. It's there anything we're missing to reach that? set outputs_per_step=1 ?
In this last plot I provided I'm not using a reduction factor. i.e: set outputs_per_step=1
Training will be 2~3 times slower however and you will probably need to drop batch size to 32. (Everything is in hparams.py)
On Fri, 20 Apr 2018, 03:08 butterl, notifications@github.com wrote:
Thanks Rayhane-mamah , tried both thchs30 and LJspeech datasets, the loss still in 0.5x even goes to 60k+ steps and for begeekmyfriend's plot 4.x even to 10k+, while in your 46000 step plot the loss is 0.23x. It's there anything we're missing to reach that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382943742, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwKIgD7hgiu70vqiuwT3fbd4RyO-oks5tqUMagaJpZM4TSyum .
There's a bug related to this... if you set '--hparams "outputs_per_step=1"' on the command line (but don't change hparams.py, there will be a mis-matched shape error later on. I fixed it by changing tacotron.py from
stop_projection = StopProjection(is_training, scope='stop_token_projection')
to
stop_projection = StopProjection(is_training, hp.outputs_per_step, scope='stop_token_projection')
because the hparams from the command line will not be seen in the StopProjection constructor. This doesn't fix the actual problem, though; I've had other issues where I need to change hparams.py or the command-line hparams have no effect.
@dsmiller, thank you for reporting that, I will look into it and fix it for next commit.
In the meantime, I updated the repo to work my exact actual params I'm using for training. (0b26fa19ceaf9465e8fa62982730a2b42829a8dd)
And still decreasing: (Let's hope we don't overfit :) )
I will upload the english pretrained model as soon as it converges.
Specially, as for Chinese, there won't be over-fitting in my opinion. The Chinese characters are enormous though, the combinations of Chinese Pinyin are finite. Moreover, unlike English, the pronunciation of vowels of Chinese Pinyin are nearly unique. That says we can make some deterministic relation between Chinese pronunciation and Latin characters. Once the machine remembers the relation, it almost will not predict the pronunciation wrong. I would like to appreciate Youguang Zhou, Father of Chinese Pinyin here that his approach has brought huge advantages on Chinese mandarin TTS;-P
eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!
@Rayhane-mamah I think the hyper parameter predict_linear
important and necessary for the quality of audio synthesis. Below is the reference from Tacotron paper:
Figures 4(a) and 4(b) demonstrate the benefit of using the post-processing net. We trained a model without the post-processing net while keeping all the other components untouched (except that the decoder RNN predicts linear-scale spectrogram). With more contextual information, the prediction from the post-processing net contains better resolved harmonics (e.g. higher harmonics between bins 100 and 400) and high frequency formant structure, which reduces synthesis artifacts.
@begeekmyfriend, I thought so too, but only if you're not willing to use wavenet as a vocoder.
In fact wavenet can take care of the small noise in the predicted mels. If however you want to invert the mel outputs directly then using the post processing network to predict linear spectros is the way to go.
Also, if you do use the post processing net, please make sure to invert the linear spectros when evaluating, audio quality will be much better than mels inversion.
The only downside is the big slowdown that comes with the post processing net.. for the moment I only took the same architecture as keithito but I saw some alternatives in some other google works, I might try them out in the future.
But yeah to get a clean quality with just tacotron and griffin lim you should set predict_linear to True and invert wav from the linear spectros.
An exception has been thrown when I want to synthesize with linear prediction:
Traceback (most recent call last):
File "synthesize.py", line 33, in <module>
main()
File "synthesize.py", line 27, in main
tacotron_synthesize(args)
File "/home/leoma/Tacotron-2/tacotron/synthesize.py", line 76, in tacotron_synthesize
run_eval(args, checkpoint_path, output_dir)
File "/home/leoma/Tacotron-2/tacotron/synthesize.py", line 14, in run_eval
synth.load(checkpoint_path)
File "/home/leoma/Tacotron-2/tacotron/synthesizer.py", line 24, in load
self.model.initialize(inputs, input_lengths)
File "/home/leoma/Tacotron-2/tacotron/models/tacotron.py", line 40, in initialize
raise ValueError('Model is set to use post processing to predict linear spectrograms in training but no linear targets given!')
ValueError: Model is set to use post processing to predict linear spectrograms in training but no linear targets given!
Any results on the problem of the model moving too fast when r == 1? I see the same results (better prosody and cleaner speech, but too fast) and can run some experiments if you have ideas.
I am not sure but it seemed like further training reduced that problem. It seems like attention want to move forward fast.
You can give me your opinion on the pretrained model I provided in the latest issue ( I will set the link to it later)
All in all, it seems okey.. I tried however taking off the attention wights cumulation and only feed previous weights for attention computation. This only slowed down the attention learning process ( alignments take too long to get learned and don't reach a good quality). This also does not reduce the speech speed, only creates failure cases where the model gets stuck on some characters.
If you suspect something else to be the reason please let me know, I'll check it out.
On Wed, 25 Apr 2018, 13:26 dsmiller, notifications@github.com wrote:
Any results on the problem of the model moving too fast when r == 1? I see the same results (better prosody and cleaner speech, but too fast) and can run some experiments if you have ideas.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-384268287, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwORHe4jZC5u_j3qbC8hniGvnCG4Kks5tsGt_gaJpZM4TSyum .
Thanks @Rayhane-mamah . Nice work. Based on your code, I could also get some good Chinese synthesized samples. I can share some of them here: https://goo.gl/DiVNNz
I tried to predict linear. The new samples are here: https://goo.gl/8XhcsF I think it's better than mel.
@begeekmyfriend sorry for this super late answer!
You probably already took down the condition by now :) but in case you didn't here's a quick fix (it will be added to the repo next commit). In tacotron.py:
if not gta and self._hparams.predict_linear==True and linear_targets is None and mel_targets is not None:
raise ValueError('Model is set to use post processing to predict linear spectrograms in training but no linear targets given!')
@candlewill Thank you so much for sharing your results. I don't know Chinese so I'm gonna trust you when you say samples are good :)
It would be great if you could also accompany these samples with their input sequences. Thanks again for sharing!
@candlewill Your samples seem the voice of each word is too short for speaking compared with mine...
linear-eval-189000.zip Terminal_train_log_linear.zip
@Rayhane-mamah I have synthesized out the evaluation by linear spectrograms. The effect seems no good to the one by mel spectrograms. I am expecting your implementation on wavenet:-) By the way, if we want to predict linear spectrograms, we need to add some code below this line as follows:
if hparams.predict_linear:
self.linear_outputs = self.model.linear_outputs
Yes, I think so. @begeekmyfriend Do you have any idea why this phenomenon occurs.
I think it is the quality of dataset since the code we share is the same. My private 12-hour recording is from a professional news male anchor whose voice is quite clear and charming. Despite of this it seems the evaluation reads a bit faster than normal. In my opinion there is still something to be adjusted for the tacotron model.
@Rayhane-mamah I think in hparam.py
the num_freq
should be set as 1025 for the shape of linear_outputs
and linear_targets
and the fft_size
be set as 2048 for STFT in audio utilities according to Keith Ito's implementation.
@begeekmyfriend, sorry for the late answer!
if you want the model to be faithful to keithito's work, you should make the changes you listed. I however picked those hparams to have same audio preprocessing with Wavenet..
I also don't like how mels look when fft_size=2048.. Also, it would be better to use frame_shift_ms and set hop_size to None if you make these changes. :)
I trained the model with sample_rate=48000 on chinese mandarin dataset.And I have get good samples,But the speed of speech is a bit fast.Have you met this problem? @begeekmyfriend @candlewill
@JK1532 you could share some Audio samples to see how "the speed of speech is a bit fast"
@begeekmyfriend 你好我想问下,拼音标注是怎么标注的?格式是怎样的?有没有例子呢?
@sayyoume 样式就是THCHS-30
里的拼音标注,你可以使用python-pinyin
@candlewill I saw there was no blur in the high frequency of your samples. Would you please to tell me that how long was your dataset and each sentence fed to train?
@begeekmyfriend My corpus have about 10, 000 sentences, almost 12 hours.
@candlewill How many steps did you take to train out such results? It is strange that there is blur in the high frequncey of my spectrograms. The below one is your sample. By the way, is the transcript of Chinese mandarin like as follows?
据北京青年报报道,春运抢票高峰频现一票难求,“曲线”回家受追捧 ju4 bei3 jing1 qing1 nian2 bao4 bao4 dao4 , chun1 yun4 qiang3 piao4 gao1 feng1 pin2 xian4 yi1 piao4 nan2 qiu2 , qu1 xian4 hui2 jia1 shou4 zhui1 peng3
@candlewill Would do you like to share me the hyper parameters? Like outputs_per_step
, tacotron_initial_learning_rate
, tacotron_start_decay
, tacotron_decay_steps
, tacotron_decay_rate
? Thank you! Below are mine:
outputs_per_step = 2,
...
tacotron_start_decay = 50000, #Step at which learning decay starts
tacotron_decay_steps = 40000, #Determines the learning rate decay slope (UNDER TEST)
tacotron_decay_rate = 0.2, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate
@begeekmyfriend Here is my hyper parameters
outputs_per_step = 1,
...
tacotron_start_decay = 50000,
tacotron_decay_steps = 50000,
tacotron_decay_rate = 0.4,
tacotron_initial_learning_rate = 1e-3,
@candlewill What was the sample rate of your dataset? Have you tried 16KHz?
Hi, @begeekmyfriend, 我用的thchs30的语料做的实验,为什么跑了200000步的结果,还是远不如你的30000步的结果(可以读出来,但是分词不太正确,不那么连贯,而且每一次读的声音也不是固定的,这一次是男声,下一次可能就是女声了,而且女声也不是一个女声),是因为语料thchs30的原因吗?
@DaisyHH 是的,这个项目不支持multi-speaker,而且16KHz的音频也不行,得用22050Hz(起码对这个项目而言)
@v-yunbin 如果你有Audition的话,可以对比一下 @candlewill 的样本,下图你的样本频谱在高频部分是模糊的,我也被这个问题困扰。
eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work!