Tacoton-2 plus World vocoder

begeekmyfriend commented 5 years ago

Hey I am glad to inform you that I have succeeded to merge Tacotron model with World vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying.

I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages: tacotron-world-vocoder branch Python-Wrapper-for-World-Vocoder pysptk merlin-world-vocoder branch By the way you need use python setup.py install and the copy the so file manually into the system path for pysptk and python wrapper project.

Besides I also would like to provide two Python scripts for World vocoder resynth test. world_vocoder_resynth_scripts.zip

@Rayhane-mamah Let us rock with it! And @r9y9 thanks for your pysptk project. world_vocoder_demo.zip

begeekmyfriend commented 5 years ago

It gets better when more training now. world_vocoder_demo.zip

m-toman commented 5 years ago

This is great, also wanted to tackle this some time ago but was busy with other projects.

So you use mgc2sp and vice versa from SPTK as in the Merlin project and not the codec WORLD provides? (https://github.com/mmorise/World/blob/master/examples/codec_test/readandsynthesis.cpp)

I've tried the WORLD codec with Merlin and I found that the MGC parameterization performed better (also REAPER got rid of most of the V/UV errors) but I never dug deep into the reason for it.

begeekmyfriend commented 5 years ago

I am using early version of WORLD vocoder source from merlin but not the latest version in mmorise's repo which seems difficult to pass the resynth test scripts provided by merlin. I have not still deep insight on it. But I have forked my own modifed early verion WORLD vocoder source on my repo which works for me.

m-toman commented 5 years ago

Thanks. Perhaps worth to integrate the modifications into this repository... do you know where the criticial differences between the Keithitho repo and this one are?

begeekmyfriend commented 5 years ago

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

begeekmyfriend commented 5 years ago

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

shartoo commented 5 years ago

Nice job,thanks for your sharing!

QueenKeys commented 5 years ago

Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder

Hello, may I ask the last dimension of the bap feature extracted by pyworld is 1025, then need to change the bap parameter num_bap = 5 in hparams to num_bap = 1025?

begeekmyfriend commented 5 years ago

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.

git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init

QueenKeys commented 5 years ago

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init
when type 'git submodule update --init', Is this normal? 子模组 'lib/World'（https://github.com/mmorise/World）未对路径 'lib/World' 注册正克隆到 '/home/queen/document/Python-Wrapper-for-World-Vocoder/lib/World'... 子模组路径 'lib/World'：检出 'd7c03432d572c5a162edba9c611b3c8e367069a9'

begeekmyfriend commented 5 years ago

@QueenKeys You might use world_vocoder_resynth_scripts.zip provided on the 1st floor to testify if it has been installed successfully or not.

QueenKeys commented 5 years ago

@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init
I have completed the installation according to your instructions. The following error still occurs in the file running train.py. Traceback (most recent call last): File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/usr/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/home/queen/下载/Tacotron-2-mandarin-world-vocoder/tacotron/feeder.py", line 173, in _enqueue_next_test_group self._session.run(self._eval_enqueue_op, feed_dict=feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (20, 1236, 513) for Tensor 'datafeeder/bap_targets:0', which has shape '(?, ?, 5)'

begeekmyfriend commented 5 years ago

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test?

QueenKeys commented 5 years ago

@QueenKeys Did you checkout the right branch mandarin-world-vocoder for this test? Yes, I have tested the dimensions of the three parameters lf0, mgc, and bap in Python-Wrapper-for-World-Vocoder ，which are (, 1), (, 60), (513), when I put the hparams.py file The num_bap = 5 is changed to num_bap = 513, the train.py will run normally, but the parameters set in world-v2 in merlin should be (, 1), (, 60), (, 5)

begeekmyfriend commented 5 years ago

I do not use world-v2.

QueenKeys commented 5 years ago

I do not use world-v2.

Hello, you may have misunderstood what I mean. I didn't say that you are using world-v2, but you set num_bap to 5 in hparam.py, so I guess it is possible that you set bap as in world-v2, otherwise I am not really sure why you set num_bap to 5?

OswaldoBornemann commented 5 years ago

@begeekmyfriend It seems that in your https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder, the hparams.py does not have cbhg parameters. Does i do something wrong?

begeekmyfriend commented 5 years ago

Hi everyone, I have upgraded WORLD vocoder into the latest version where we can use havest instead of dio for F0 pitch extraction. The link adress is still on the 1st floor. Any suggestion is welcome!

Edresson commented 5 years ago

Hi, I've been training the model for Portuguese. I have the same problem reported above with the LJ Speech dataset, during eval (during training using the tacotron teacher) I get good results, but during the synthesis the results are bad. My base has 10 hours of audio and I trained approximately 272k steps, is it necessary to train more, or is there a problem with the model?

The results reported here were obtained during eval (during training using the tacotron teacher)?

begeekmyfriend commented 5 years ago

@Edresson You need to checkout the alignment like this https://github.com/mozilla/TTS/issues/9#issuecomment-473743232

Edresson commented 5 years ago

@begeekmyfriend I upgraded the repository as described, but the network does not converge, I believe it can be overfit, I tried Tacatron with grinffin-lim and also did not have good results. The Tacotron does not seem to converge on my own dataset. With my own dataset I managed to get good results with DCTTS, but with the tacotron the results are very bad. Do you have any suggestion ?

begeekmyfriend commented 5 years ago

The Griffin Lim branch is only for Chinese mandarin. Did you change the dictionary of your own? As for WORLD features in my tests. For some of dataset it can learn alignment quickly but for others it fails. I am still working around with it.

Edresson commented 5 years ago

@begeekmyfriend yes I changed but I trained the model few steps I believe that with Griffin Lim the tacotron with my dataset converge after many steps, using the DCTTS I needed 2000k steps to get good results. For Tacotron-World the loss varies a lot during the training, and does not learn the alimentation. I also tried using the DCTTS with World vocoder, and I have the same problem, the model does not converge. If you get new results please inform me, I'm working on it too, any progress I will report here.

begeekmyfriend commented 5 years ago

If it fails under griffin lim branch it might well your dataset is not good enough for TTS

Edresson commented 5 years ago

@begeekmyfriend I agree with you however using DCTTS get good results, as tacotron is more powerful I believe you need more data, I will train a model with griffin lim to check if that is the problem.

begeekmyfriend commented 5 years ago

Latest commit https://github.com/begeekmyfriend/Tacotron-2/commit/e40a7b73ac31d299d731439fbabe8921b231a739 Any feedback is welcome! step-13000-align

superhg2012 commented 5 years ago

step-20000-eval-align

Hi, @begeekmyfriend , I am runing Tacotron2 + pyworld using Biaobei(10000) tts corpus. Why my above alignment result is not continuious?

begeekmyfriend commented 5 years ago

I forgot to tell you that for differnt dataset, adjust the hp.max_frame_num and hp.max_text_length adopted by guided attention as the alignment slope for better convergence.

superhg2012 commented 5 years ago

@begeekmyfriend I increased max_frame_num to 900 and keep max_text_length=300, should I increase max_text_length ？ what's the relation between these twp params?

begeekmyfriend commented 5 years ago

@superhg2012 NO, Biaobei dataset contains shorter clips and texts. You need to reduce these lengths to adapt the best ratio for N:T.

superhg2012 commented 5 years ago

@begeekmyfriend I get it! thanks!!

superhg2012 commented 5 years ago

Hi, @begeekmyfriend During my training T2, the eval stop_token loss is increasing while the train stop_token_loss is decreasing. I found that in my training corpus, there is no puncuntations while some of the sentences for eval in hparams contains puncuntations. Is this root cause?

Edresson commented 5 years ago

Hi @begeekmyfriend , the alignment looks good however, when I run synthesize.py, I get an audio with only noise, without speech. Did you get good results using synthesize.py? See below the images of the alignments.

During training: During train 57k During the eval (in training) During eval 57k

begeekmyfriend commented 5 years ago

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

superhg2012 commented 5 years ago

Because the stop token loss did not reduce to zero. My test is still undergoing as well.

switch dio for f0 estimation to harvest. Does it work for synthesize improvement?

begeekmyfriend commented 5 years ago

You may test it by resynth script. world_resynth.zip

begeekmyfriend commented 5 years ago

I found that MSE fits WORLD features more than MAE does because the value scales of lf0, mgc and bap differ. And the attention can be kept through the whole training. MAE works well for mel spectrograms because it contains only one kind of feature. See https://github.com/begeekmyfriend/Tacotron-2/commit/5863d5513ed34f94711a310d57722d0b1f990264

superhg2012 commented 5 years ago

@begeekmyfriend can you share a better synthesized sample audio?

begeekmyfriend commented 5 years ago

The demo has been shown on the floors where there is spectrogram graph. Maybe I need to reduce the frame period to obtain better quality. However I was told that the fidelity of sample is not as good as that from G&L. The quality is better indeed.

begeekmyfriend commented 5 years ago

By the way when you want to hear complete synthesized samples, please wait until the stop token loss reduced to zero.

begeekmyfriend commented 5 years ago

As for alignment, remember adapting your max_text_length and max_frame_num to the best ratio of N:T which depends on your dataset.

begeekmyfriend commented 5 years ago

mrgloom commented 5 years ago

@begeekmyfriend Is pretrained model compatible with this repo https://github.com/begeekmyfriend/tacotron/tree/mandarin-world-vocoder is available for test?

begeekmyfriend commented 5 years ago

Imcompatible. In fact mantaining both of those tacotron project would exhaust me. So I am fixed on my T2 fork currently.

superhg2012 commented 5 years ago

@begeekmyfriend your learning cuve is better than mine, great!!

begeekmyfriend commented 5 years ago

Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model. xmly_biaobei_world.zip

sujeendran commented 5 years ago

Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.

@begeekmyfriend Can you please tell what are the modifications/steps required in the current version of your T1 repo to make it run with LJ Speech Dataset? I am sorry I am asking you this now because most of the steps are mixed up in the previous comments and I thought it would be helpful for others to have it in one place too. Also does this T1 version run with your updated WORLD vocoder repo?

begeekmyfriend commented 5 years ago

T1 modification is just a trivial version. I am focus on T2 currently.

sujeendran commented 5 years ago

@begeekmyfriend Thanks for the response. In that case, do you mind giving a concise list of steps required for LJ Speech dataset?

begeekmyfriend commented 5 years ago

https://github.com/begeekmyfriend/Tacotron-2/tree/griffin-lim

Rayhane-mamah / Tacotron-2

Tacoton-2 plus World vocoder #304