Closed begeekmyfriend closed 5 years ago
It gets better when more training now. world_vocoder_demo.zip
This is great, also wanted to tackle this some time ago but was busy with other projects.
So you use mgc2sp and vice versa from SPTK as in the Merlin project and not the codec WORLD provides? (https://github.com/mmorise/World/blob/master/examples/codec_test/readandsynthesis.cpp)
I've tried the WORLD codec with Merlin and I found that the MGC parameterization performed better (also REAPER got rid of most of the V/UV errors) but I never dug deep into the reason for it.
I am using early version of WORLD vocoder source from merlin but not the latest version in mmorise's repo which seems difficult to pass the resynth test scripts provided by merlin. I have not still deep insight on it. But I have forked my own modifed early verion WORLD vocoder source on my repo which works for me.
Thanks. Perhaps worth to integrate the modifications into this repository... do you know where the criticial differences between the Keithitho repo and this one are?
Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.
Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder
Nice job,thanks for your sharing!
Here is the implementation on my T2 fork branch only for Chinese mandarin https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder
Hello, may I ask the last dimension of the bap feature extracted by pyworld is 1025, then need to change the bap parameter num_bap = 5 in hparams to num_bap = 1025?
@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git
git submodule update --init
@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git git submodule update --init
when type 'git submodule update --init', Is this normal? 子模组 'lib/World'(https://github.com/mmorise/World)未对路径 'lib/World' 注册 正克隆到 '/home/queen/document/Python-Wrapper-for-World-Vocoder/lib/World'... 子模组路径 'lib/World':检出 'd7c03432d572c5a162edba9c611b3c8e367069a9'
@QueenKeys You might use world_vocoder_resynth_scripts.zip
provided on the 1st floor to testify if it has been installed successfully or not.
@QueenKeys I am using WORLD vocoder from Merlin but not the latest version on the repo. So please type such command and you would get the right vocoder library.
git clone https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder.git git submodule update --init
I have completed the installation according to your instructions. The following error still occurs in the file running train.py. Traceback (most recent call last): File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/usr/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/home/queen/下载/Tacotron-2-mandarin-world-vocoder/tacotron/feeder.py", line 173, in _enqueue_next_test_group self._session.run(self._eval_enqueue_op, feed_dict=feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (20, 1236, 513) for Tensor 'datafeeder/bap_targets:0', which has shape '(?, ?, 5)'
@QueenKeys Did you checkout the right branch mandarin-world-vocoder
for this test?
@QueenKeys Did you checkout the right branch
mandarin-world-vocoder
for this test? Yes, I have tested the dimensions of the three parameters lf0, mgc, and bap in Python-Wrapper-for-World-Vocoder ,which are (, 1), (, 60), (513), when I put the hparams.py file The num_bap = 5 is changed to num_bap = 513, the train.py will run normally, but the parameters set in world-v2 in merlin should be (, 1), (, 60), (, 5)
I do not use world-v2.
I do not use world-v2.
Hello, you may have misunderstood what I mean. I didn't say that you are using world-v2, but you set num_bap to 5 in hparam.py, so I guess it is possible that you set bap as in world-v2, otherwise I am not really sure why you set num_bap to 5?
@begeekmyfriend It seems that in your https://github.com/begeekmyfriend/Tacotron-2/tree/mandarin-world-vocoder
, the hparams.py
does not have cbhg
parameters. Does i do something wrong?
Hi everyone, I have upgraded WORLD vocoder into the latest version where we can use havest instead of dio for F0 pitch extraction. The link adress is still on the 1st floor. Any suggestion is welcome!
Hi, I've been training the model for Portuguese. I have the same problem reported above with the LJ Speech dataset, during eval (during training using the tacotron teacher) I get good results, but during the synthesis the results are bad. My base has 10 hours of audio and I trained approximately 272k steps, is it necessary to train more, or is there a problem with the model?
The results reported here were obtained during eval (during training using the tacotron teacher)?
@Edresson You need to checkout the alignment like this https://github.com/mozilla/TTS/issues/9#issuecomment-473743232
@begeekmyfriend I upgraded the repository as described, but the network does not converge, I believe it can be overfit, I tried Tacatron with grinffin-lim and also did not have good results. The Tacotron does not seem to converge on my own dataset. With my own dataset I managed to get good results with DCTTS, but with the tacotron the results are very bad. Do you have any suggestion ?
The Griffin Lim
branch is only for Chinese mandarin. Did you change the dictionary of your own?
As for WORLD features in my tests. For some of dataset it can learn alignment quickly but for others it fails. I am still working around with it.
@begeekmyfriend yes I changed but I trained the model few steps I believe that with Griffin Lim the tacotron with my dataset converge after many steps, using the DCTTS I needed 2000k steps to get good results. For Tacotron-World the loss varies a lot during the training, and does not learn the alimentation. I also tried using the DCTTS with World vocoder, and I have the same problem, the model does not converge. If you get new results please inform me, I'm working on it too, any progress I will report here.
If it fails under griffin lim
branch it might well your dataset is not good enough for TTS
@begeekmyfriend I agree with you however using DCTTS get good results, as tacotron is more powerful I believe you need more data, I will train a model with griffin lim to check if that is the problem.
Latest commit https://github.com/begeekmyfriend/Tacotron-2/commit/e40a7b73ac31d299d731439fbabe8921b231a739 Any feedback is welcome!
Hi, @begeekmyfriend , I am runing Tacotron2 + pyworld using Biaobei(10000) tts corpus. Why my above alignment result is not continuious?
I forgot to tell you that for differnt dataset, adjust the hp.max_frame_num
and hp.max_text_length
adopted by guided attention as the alignment slope for better convergence.
@begeekmyfriend I increased max_frame_num to 900 and keep max_text_length=300, should I increase max_text_length ? what's the relation between these twp params?
@superhg2012 NO, Biaobei dataset contains shorter clips and texts. You need to reduce these lengths to adapt the best ratio for N:T.
@begeekmyfriend I get it! thanks!!
Hi, @begeekmyfriend During my training T2, the eval stop_token loss is increasing while the train stop_token_loss is decreasing. I found that in my training corpus, there is no puncuntations while some of the sentences for eval in hparams contains puncuntations. Is this root cause?
Hi @begeekmyfriend , the alignment looks good however, when I run synthesize.py, I get an audio with only noise, without speech. Did you get good results using synthesize.py? See below the images of the alignments.
During training: During the eval (in training)
Because the stop token loss did not reduce to zero. My test is still undergoing as well.
Because the stop token loss did not reduce to zero. My test is still undergoing as well.
switch dio for f0 estimation to harvest. Does it work for synthesize improvement?
You may test it by resynth script. world_resynth.zip
I found that MSE fits WORLD features more than MAE does because the value scales of lf0, mgc and bap differ. And the attention can be kept through the whole training. MAE works well for mel spectrograms because it contains only one kind of feature. See https://github.com/begeekmyfriend/Tacotron-2/commit/5863d5513ed34f94711a310d57722d0b1f990264
@begeekmyfriend can you share a better synthesized sample audio?
The demo has been shown on the floors where there is spectrogram graph. Maybe I need to reduce the frame period to obtain better quality. However I was told that the fidelity of sample is not as good as that from G&L. The quality is better indeed.
By the way when you want to hear complete synthesized samples, please wait until the stop token loss reduced to zero.
As for alignment, remember adapting your max_text_length
and max_frame_num
to the best ratio of N:T which depends on your dataset.
@begeekmyfriend Is pretrained model compatible with this repo https://github.com/begeekmyfriend/tacotron/tree/mandarin-world-vocoder is available for test?
Imcompatible. In fact mantaining both of those tacotron project would exhaust me. So I am fixed on my T2 fork currently.
@begeekmyfriend your learning cuve is better than mine, great!!
Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model. xmly_biaobei_world.zip
Well my fork from Ito's repo is just experimental project for my own tests and it is easy to modify thanks to its less code. I have ported some T2 code (e.x. location sensitive attention, stop tokens and dropout etc.) on my T1 fork to see what would happen. Generally speaking there is less different moduls between these two repos. You might regard my T1 fork as a simplified version of this T2 project.
@begeekmyfriend Can you please tell what are the modifications/steps required in the current version of your T1 repo to make it run with LJ Speech Dataset? I am sorry I am asking you this now because most of the steps are mixed up in the previous comments and I thought it would be helpful for others to have it in one place too. Also does this T1 version run with your updated WORLD vocoder repo?
T1 modification is just a trivial version. I am focus on T2 currently.
@begeekmyfriend Thanks for the response. In that case, do you mind giving a concise list of steps required for LJ Speech dataset?
Hey I am glad to inform you that I have succeeded to merge Tacotron model with World vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying.
I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages: tacotron-world-vocoder branch Python-Wrapper-for-World-Vocoder pysptk merlin-world-vocoder branch By the way you need use
python setup.py install
and the copy the so file manually into the system path forpysptk
and python wrapper project.Besides I also would like to provide two Python scripts for World vocoder resynth test. world_vocoder_resynth_scripts.zip
@Rayhane-mamah Let us rock with it! And @r9y9 thanks for your
pysptk
project. world_vocoder_demo.zip