Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.28k stars 905 forks source link

Evaluation on Chinese mandarin #18

Closed begeekmyfriend closed 6 years ago

begeekmyfriend commented 6 years ago

step-30000-align step-30000-pred-mel-spectrogram step-30000-real-mel-spectrogram eval-30000.zip Here are the evaluation results of my training on 12-hour Chinese mandarin corpus. The voice sounds natural but still somewhat rough. The modification has been opened on my own repo with mandarin branch. Thanks a lot for this work!

Rayhane-mamah commented 6 years ago

Hello @begeekmyfriend, awesome results, thank you very much for sharing them!

Oh okey.. I didn't expect to find a 1min long audio! I'm not Mandarin expert but the overall results sound okay.

About the audio quality, as I'm used to say, inverting Mel spectrograms directly can be challenging and will always have that "robotic" noise in it, I am currently working on integrating Wavenet as Vocoder which will give the human like speech quality so stay tuned!

Being the perfectionist I am, I will try figuring out a way to improve the predicted mels even more however, I won't settle for less than a perfect match with real spectros :) (even though more training should improve quality a bit)

Awesome work, feel free to share any further results, trained models, or even merge the branches to add mandarin support if you like!

begeekmyfriend commented 6 years ago

The audio has been concatenate with 12 separated wav files from 12 sentences by FFmpeg...

The point I appreciate is that there need to be only less dataset to be feed to train out an effective model on your project. You know it is usually tough for many people to make out a long time corpus and sanity check is valuable as well. I am going to train more steps for evaluation and I will also post my results at that time.

I would mark this result as a baseline for all the tacotron projects I have ever seen. And I believe your work will help other authors to improve their jobs. Thanks again!

Rayhane-mamah commented 6 years ago

Yeah I noticed tge concatenation :)

Thank you sir for you support, I'm glad this work suits your needs.

Hopefully we can achieve better results in the upcoming days!

Looking forward for your results!

begeekmyfriend commented 6 years ago

Terminal_train_log.zip It seems there is vibration for loss value during the training at 100+K steps. Is there any hyper parameter to be adjusted such as decay learning rate? ... By the way, the loss value dropped to 0.3 at 107K steps after the vibration.

Rayhane-mamah commented 6 years ago

Hello again @begeekmyfriend, thanks for reporting that.

I noticed these vibrations as well. After having a second look at the learning rate decay function I noticed the horrible mistake I was making in the computation of this decay.. Anyway learning rates were bigger than what they are supposed to be which explains those vibrations..

Crazy how a typo can make your life a hell.. :) it's fixed now, and learning rate decay is set to start after 50k steps and reach its minimal value (1e-5) at 150k steps. (d28cfa9a77afc87902100bd5b2113fbb8541227e)

Here's what the new learning rate evolution looks like (lr vs training_steps): plot

Since in the T2 paper, decay parameters were not specified, I tried optimizing those params for our case, so they might need some extra tweaking.

You will also notice a faster learning of attention and a faster loss drop.. From these two observations one could guess that model quality will improve, we'll find out once training is finished.

Note: i messed around with the preprocessing a bit. You maybe want to keep that in mind in case it affects your preprocessed data. (Made changes in the indices of the feeder and metadata). I also took off lowercasing in the cleaners because I noticed it wasn't useful for english cleaners. If you however need that lowercasing for mandarin you can put it back in tacotron/utils/cleaners.py.

butterl commented 6 years ago

@begeekmyfriend Hi Leo ,Thanks for your work. which dataset are you using, thchs30?

begeekmyfriend commented 6 years ago

No, I am using my private dataset. THCHS30 is feasible as well.

Rayhane-mamah commented 6 years ago

Okey you tell me how it goes then, I'll be running few tests on my own too. Hopefully we find the optimal scheme.

On Mon, 16 Apr 2018, 03:04 Leo Ma, notifications@github.com wrote:

@Rayhane-mamah https://github.com/Rayhane-mamah I am trying Noam Scheme as Keith Ito's tacotron does. It is also mentioned in tensor2tensor tensorflow/tensor2tensor#280 https://github.com/tensorflow/tensor2tensor/issues/280

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-381459677, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwD5wZHF3jVaGeXSBJ-Zh1DfHDe1Oks5to_wlgaJpZM4TSyum .

begeekmyfriend commented 6 years ago

I have changed the hyper parameter on learning rate decay as follows since there is vibration on loss value after 10K steps.

tacotron_start_decay = 10000, #Step at which learning decay starts
tacotron_decay_steps = 10000, #starting point for learning rate decay (and determines the decay slope) (UNDER TEST)
tacotron_decay_rate = 0.33, #learning rate decay rate (UNDER TEST)
begeekmyfriend commented 6 years ago

train.log.zip I think the loss value decreases very slow in your tacotron according to that in Keith Ito's implementation. In train.log we can see the avg_loss decreasing smoothly and in Terminal_train_log it often freezes for a long time.

Rayhane-mamah commented 6 years ago

@begeekmyfriend, you provided my logs of a previous repo version, did you upgrade to use latest one? I am seeing loss "explosions" in the log, I hope you're note referring to those when you say loss is vibrating..

Just to be sure we're talking about the same thing, here's what loss vibrations look like to me: screenshot from 2018-04-17 07-26-34

After a smoothing of 0.93: screenshot from 2018-04-17 07-26-44

The overall loss is still decreasing slowly. If by vibrations you're still referring to these explosions, you can set tacotron_scale_regularization = True in hparams.py, should take care of it.

I also want to point out that T1 and T2 use different loss functions and we do not rescale our mel outputs the same way. (I normalize my mels to [-4, 4]).

screenshot from 2018-04-17 07-33-22

You can change this option back to [0, 1] by setting in hparams.py:

symmetric_mels = False,
max_abs_value = 1.,
ghost commented 6 years ago

@begeekmyfriend Thanks for you contribution. :) I am applying THCHS30 as the speech corpus here, however I find out an "interesting" phenomenon:

Step    2500 [1.775 sec/step, loss=0.84299, avg_loss=0.84627]
Writing summary at step: 2500
Saving checkpoint to: logs-Tacotron/pretrained/model.ckpt-2500
Saving alignment, Mel-Spectrograms and griffin-lim inverted waveform..
Input at step 2500: tantwo huaone yitwo xianfour defive wangtwo shuone wentwo gaithree getwo youtwo yutwo faone shengone zaifour tangtwo shunfour zongone yongthree zhenone niantwo jianone suotwo yithree youfour beifour chengone weitwo yongthree zhenone getwo xinone~________________________________________________________________________________________

The tones of pinyin (presenting as 1,2,3,4,5) will be automatically converted into one, two, three, four, five. :( The original input at step 2500 should be:

Input at step 2500: tan2 hua1 yi2 xian4... How could I correct this fatal input? :(

begeekmyfriend commented 6 years ago

@DavidAksnes It never happens to me...

Rayhane-mamah commented 6 years ago

Hello david,

Use basic cleaners instead of english cleaners in hparams.py

For english numbers are converted to normalized words.

On Wed, 18 Apr 2018, 10:58 Leo Ma, notifications@github.com wrote:

@DavidAksnes https://github.com/DavidAksnes It never happens to me...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382333191, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwI6S0PLjJaw0fHiOdBVh360P4O4jks5tpw4tgaJpZM4TSyum .

begeekmyfriend commented 6 years ago

tensorboard image train.log.zip Above all are the tensorboard and the training log with tacotron model of yours and Keith Ito's. I have run your latest version with learning rate as follows:

tacotron_start_decay = 50000, #Step at which learning decay starts
tacotron_decay_steps = 25000, #starting point for learning rate decay (and determines the decay slope) (UNDER TEST)
tacotron_decay_rate = 0.33, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate
tacotron_final_learning_rate = 1e-5, #minimal learning rate

I used the same dataset as THCHS-30 with 34h long. We can see that the loss value is still very high. In my experience, loss less than 0.1 is the acceptable result as Ito's tacotron achieved.

butterl commented 6 years ago

I use patch of @begeekmyfriend for CHS and My machine is slow only goes to 20K step now(run from yesterday), the loss stays at 0.6x, and same thing happens to the original repo of @Rayhane-mamah(0.5x up to 60k+)

Rayhane-mamah commented 6 years ago

@begeekmyfriend and @butterl, About the loss values, it would be preferable if you don't try comparing between my work and keithito's. Let me explain why such comparison isn't useful:

Tacotron Feature prediction network outputs are mel-spectrograms, we can consider them as matrices of shape [mel_frames, 80]. Let me take the examples of [100, 80] -> the model will have to predict 8000 values, one vector of [1, 80] at a time (or matrix of [5, 80] at a time if reduction factor r=5).

Here, I took the liberty to plot two random distributions of 8000 points, with different means and std values.

index

What I want to bring your attention to here is the fact that when preprocessing the data, I tend to create bigger distances between data points (in our case, distance between brown and red). You might wonder why I do that, here's the reason.

If you want to scale your data to [0, 1] like in keithito's work please make these changes in hparams.py:

symmetric_mels = False, max_abs_value = 1.,

Loss will drop to 3e-3.. the thing is, loss value itself is absolute and is not normalized by data norm or anything, so comparing losses from two differently distributed outputs is a little extreme..

I also want to point out that by widening the output range to [-4, 4], we were able to reduce blurriness in outputs probably due to the added penalization on L2 loss.

With this said, I also experienced better output quality with outputs_per_step=1 (i.e: no reduction factor, predict outputs one frame at a time). It's actually predictable since it's easier for the model to predict next mel frame only instead of trying to predict 5 frames at a time.

step-46000-pred-mel-spectrogram

Attention is also working:

step-46000-align

I however noticed that the model is moving forward too fast, leading to a fast reading when synthesizing without teacher forcing, I'm currently trying to determine the cause and set it as a user choice (despite the reading speed, everything is audible and less robotic).

Please note that using r=1 will cause a training and synthesis slowdown of about x2.5~x3. and batch size will probably need to be dropped down to 32.

I wonder if dropping the learning rate this fast is beneficial.. I'll do some experiments on that and tell you how it goes.

Note: Mel spectros will always be blurry when trained with a L2 loss, but it isn't a problem for the wavenet. paper reference: "However, when trained on ground truth features and made to synthesize from predicted features, the result is worse than the opposite. This is due to the tendency of the predicted spectrograms to be oversmoothed and less detailed than the ground truth – a consequence of the squared error loss optimized by the feature prediction network. When trained on ground truth spectrograms, the network does not learn to generate high quality speech waveforms from oversmoothed features."

So if you happen to have a pretrained Wavenet model on ground truth labels, it will most likely not give human like speech quality. Retrain the vocoder instead on the GTA synthesized outputs of the Frame Prediction Network.

begeekmyfriend commented 6 years ago

I see. Thank you for your explain. I am afraid I have to look closely into it and make myself understand your meaning:-)

Rayhane-mamah commented 6 years ago

I know i'm not the best at giving lectures or explain stuff to people..

If there is anything I need to develop or explain further please let me know.

On Thu, 19 Apr 2018, 10:42 Leo Ma, notifications@github.com wrote:

I see. Thank you for your explain. I am afraid I have to look closely into it and make myself understand your meaning:-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382674958, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwI5MjwsmzCurSZbwGSkjvbQ9jfQDks5tqFv1gaJpZM4TSyum .

butterl commented 6 years ago

Thanks Rayhane-mamah , tried both thchs30 and LJspeech datasets, the loss still in 0.5x even goes to 60k+ steps and for begeekmyfriend's plot 4.x even to 10k+, while in your 46000 step plot the loss is 0.23x. It's there anything we're missing to reach that? set outputs_per_step=1 ?

Rayhane-mamah commented 6 years ago

In this last plot I provided I'm not using a reduction factor. i.e: set outputs_per_step=1

Training will be 2~3 times slower however and you will probably need to drop batch size to 32. (Everything is in hparams.py)

On Fri, 20 Apr 2018, 03:08 butterl, notifications@github.com wrote:

Thanks Rayhane-mamah , tried both thchs30 and LJspeech datasets, the loss still in 0.5x even goes to 60k+ steps and for begeekmyfriend's plot 4.x even to 10k+, while in your 46000 step plot the loss is 0.23x. It's there anything we're missing to reach that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-382943742, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwKIgD7hgiu70vqiuwT3fbd4RyO-oks5tqUMagaJpZM4TSyum .

dsmiller commented 6 years ago

There's a bug related to this... if you set '--hparams "outputs_per_step=1"' on the command line (but don't change hparams.py, there will be a mis-matched shape error later on. I fixed it by changing tacotron.py from stop_projection = StopProjection(is_training, scope='stop_token_projection') to stop_projection = StopProjection(is_training, hp.outputs_per_step, scope='stop_token_projection') because the hparams from the command line will not be seen in the StopProjection constructor. This doesn't fix the actual problem, though; I've had other issues where I need to change hparams.py or the command-line hparams have no effect.

Rayhane-mamah commented 6 years ago

@dsmiller, thank you for reporting that, I will look into it and fix it for next commit.

In the meantime, I updated the repo to work my exact actual params I'm using for training. (0b26fa19ceaf9465e8fa62982730a2b42829a8dd)

step-90500-pred-mel-spectrogram

And still decreasing: (Let's hope we don't overfit :) )

screenshot from 2018-04-20 22-31-08

I will upload the english pretrained model as soon as it converges.

begeekmyfriend commented 6 years ago

Specially, as for Chinese, there won't be over-fitting in my opinion. The Chinese characters are enormous though, the combinations of Chinese Pinyin are finite. Moreover, unlike English, the pronunciation of vowels of Chinese Pinyin are nearly unique. That says we can make some deterministic relation between Chinese pronunciation and Latin characters. Once the machine remembers the relation, it almost will not predict the pronunciation wrong. I would like to appreciate Youguang Zhou, Father of Chinese Pinyin here that his approach has brought huge advantages on Chinese mandarin TTS;-P

begeekmyfriend commented 6 years ago

eval-112000.zip Terminal_train_log.zip Here are the latest evaluation at 112K steps and training log. I have to say the effects are amazing!

begeekmyfriend commented 6 years ago

@Rayhane-mamah I think the hyper parameter predict_linear important and necessary for the quality of audio synthesis. Below is the reference from Tacotron paper:

Figures 4(a) and 4(b) demonstrate the benefit of using the post-processing net. We trained a model without the post-processing net while keeping all the other components untouched (except that the decoder RNN predicts linear-scale spectrogram). With more contextual information, the prediction from the post-processing net contains better resolved harmonics (e.g. higher harmonics between bins 100 and 400) and high frequency formant structure, which reduces synthesis artifacts.

Rayhane-mamah commented 6 years ago

@begeekmyfriend, I thought so too, but only if you're not willing to use wavenet as a vocoder.

In fact wavenet can take care of the small noise in the predicted mels. If however you want to invert the mel outputs directly then using the post processing network to predict linear spectros is the way to go.

Also, if you do use the post processing net, please make sure to invert the linear spectros when evaluating, audio quality will be much better than mels inversion.

The only downside is the big slowdown that comes with the post processing net.. for the moment I only took the same architecture as keithito but I saw some alternatives in some other google works, I might try them out in the future.

But yeah to get a clean quality with just tacotron and griffin lim you should set predict_linear to True and invert wav from the linear spectros.

begeekmyfriend commented 6 years ago

An exception has been thrown when I want to synthesize with linear prediction:

Traceback (most recent call last):
  File "synthesize.py", line 33, in <module>
    main()
  File "synthesize.py", line 27, in main
    tacotron_synthesize(args)
  File "/home/leoma/Tacotron-2/tacotron/synthesize.py", line 76, in tacotron_synthesize
    run_eval(args, checkpoint_path, output_dir)
  File "/home/leoma/Tacotron-2/tacotron/synthesize.py", line 14, in run_eval
    synth.load(checkpoint_path)
  File "/home/leoma/Tacotron-2/tacotron/synthesizer.py", line 24, in load
    self.model.initialize(inputs, input_lengths)
  File "/home/leoma/Tacotron-2/tacotron/models/tacotron.py", line 40, in initialize
    raise ValueError('Model is set to use post processing to predict linear spectrograms in training but no linear targets given!')
ValueError: Model is set to use post processing to predict linear spectrograms in training but no linear targets given!
dsmiller commented 6 years ago

Any results on the problem of the model moving too fast when r == 1? I see the same results (better prosody and cleaner speech, but too fast) and can run some experiments if you have ideas.

Rayhane-mamah commented 6 years ago

I am not sure but it seemed like further training reduced that problem. It seems like attention want to move forward fast.

You can give me your opinion on the pretrained model I provided in the latest issue ( I will set the link to it later)

All in all, it seems okey.. I tried however taking off the attention wights cumulation and only feed previous weights for attention computation. This only slowed down the attention learning process ( alignments take too long to get learned and don't reach a good quality). This also does not reduce the speech speed, only creates failure cases where the model gets stuck on some characters.

If you suspect something else to be the reason please let me know, I'll check it out.

On Wed, 25 Apr 2018, 13:26 dsmiller, notifications@github.com wrote:

Any results on the problem of the model moving too fast when r == 1? I see the same results (better prosody and cleaner speech, but too fast) and can run some experiments if you have ideas.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rayhane-mamah/Tacotron-2/issues/18#issuecomment-384268287, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwORHe4jZC5u_j3qbC8hniGvnCG4Kks5tsGt_gaJpZM4TSyum .

candlewill commented 6 years ago

Thanks @Rayhane-mamah . Nice work. Based on your code, I could also get some good Chinese synthesized samples. I can share some of them here: https://goo.gl/DiVNNz

I tried to predict linear. The new samples are here: https://goo.gl/8XhcsF I think it's better than mel.

Rayhane-mamah commented 6 years ago

@begeekmyfriend sorry for this super late answer!

You probably already took down the condition by now :) but in case you didn't here's a quick fix (it will be added to the repo next commit). In tacotron.py:

if not gta and self._hparams.predict_linear==True and linear_targets is None and mel_targets is not None: raise ValueError('Model is set to use post processing to predict linear spectrograms in training but no linear targets given!')

@candlewill Thank you so much for sharing your results. I don't know Chinese so I'm gonna trust you when you say samples are good :)

It would be great if you could also accompany these samples with their input sequences. Thanks again for sharing!

begeekmyfriend commented 6 years ago

@candlewill Your samples seem the voice of each word is too short for speaking compared with mine...

linear-eval-189000.zip Terminal_train_log_linear.zip

@Rayhane-mamah I have synthesized out the evaluation by linear spectrograms. The effect seems no good to the one by mel spectrograms. I am expecting your implementation on wavenet:-) By the way, if we want to predict linear spectrograms, we need to add some code below this line as follows:

if hparams.predict_linear:
        self.linear_outputs = self.model.linear_outputs
candlewill commented 6 years ago

Yes, I think so. @begeekmyfriend Do you have any idea why this phenomenon occurs.

begeekmyfriend commented 6 years ago

I think it is the quality of dataset since the code we share is the same. My private 12-hour recording is from a professional news male anchor whose voice is quite clear and charming. Despite of this it seems the evaluation reads a bit faster than normal. In my opinion there is still something to be adjusted for the tacotron model.

begeekmyfriend commented 6 years ago

@Rayhane-mamah I think in hparam.py the num_freq should be set as 1025 for the shape of linear_outputs and linear_targets and the fft_size be set as 2048 for STFT in audio utilities according to Keith Ito's implementation.

Rayhane-mamah commented 6 years ago

@begeekmyfriend, sorry for the late answer!

if you want the model to be faithful to keithito's work, you should make the changes you listed. I however picked those hparams to have same audio preprocessing with Wavenet..

I also don't like how mels look when fft_size=2048.. Also, it would be better to use frame_shift_ms and set hop_size to None if you make these changes. :)

JK1532 commented 6 years ago

I trained the model with sample_rate=48000 on chinese mandarin dataset.And I have get good samples,But the speed of speech is a bit fast.Have you met this problem? @begeekmyfriend @candlewill

butterl commented 6 years ago

@JK1532 you could share some Audio samples to see how "the speed of speech is a bit fast"

sayyoume commented 6 years ago

@begeekmyfriend 你好我想问下,拼音标注是怎么标注的?格式是怎样的?有没有例子呢?

begeekmyfriend commented 6 years ago

@sayyoume 样式就是THCHS-30里的拼音标注,你可以使用python-pinyin

begeekmyfriend commented 6 years ago

@candlewill I saw there was no blur in the high frequency of your samples. Would you please to tell me that how long was your dataset and each sentence fed to train?

candlewill commented 6 years ago

@begeekmyfriend My corpus have about 10, 000 sentences, almost 12 hours.

begeekmyfriend commented 6 years ago

@candlewill How many steps did you take to train out such results? It is strange that there is blur in the high frequncey of my spectrograms. The below one is your sample. By the way, is the transcript of Chinese mandarin like as follows?

据北京青年报报道,春运抢票高峰频现一票难求,“曲线”回家受追捧 ju4 bei3 jing1 qing1 nian2 bao4 bao4 dao4 , chun1 yun4 qiang3 piao4 gao1 feng1 pin2 xian4 yi1 piao4 nan2 qiu2 , qu1 xian4 hui2 jia1 shou4 zhui1 peng3

image image

begeekmyfriend commented 6 years ago

@candlewill Would do you like to share me the hyper parameters? Like outputs_per_step, tacotron_initial_learning_rate, tacotron_start_decay, tacotron_decay_steps, tacotron_decay_rate? Thank you! Below are mine:

outputs_per_step = 2,
...
tacotron_start_decay = 50000, #Step at which learning decay starts
tacotron_decay_steps = 40000, #Determines the learning rate decay slope (UNDER TEST)
tacotron_decay_rate = 0.2, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate

image

candlewill commented 6 years ago

@begeekmyfriend Here is my hyper parameters

outputs_per_step = 1,
...
tacotron_start_decay = 50000,
tacotron_decay_steps = 50000,
tacotron_decay_rate = 0.4,
tacotron_initial_learning_rate = 1e-3, 
begeekmyfriend commented 6 years ago

@candlewill What was the sample rate of your dataset? Have you tried 16KHz?

DaisyHH commented 6 years ago

Hi, @begeekmyfriend, 我用的thchs30的语料做的实验,为什么跑了200000步的结果,还是远不如你的30000步的结果(可以读出来,但是分词不太正确,不那么连贯,而且每一次读的声音也不是固定的,这一次是男声,下一次可能就是女声了,而且女声也不是一个女声),是因为语料thchs30的原因吗?

begeekmyfriend commented 6 years ago

@DaisyHH 是的,这个项目不支持multi-speaker,而且16KHz的音频也不行,得用22050Hz(起码对这个项目而言)

begeekmyfriend commented 6 years ago

@v-yunbin 如果你有Audition的话,可以对比一下 @candlewill 的样本,下图你的样本频谱在高频部分是模糊的,我也被这个问题困扰。 image image