Open ErfolgreichCharismatisch opened 6 years ago
check #183
I have trained Tacotron with 77,000 steps, wavenet with 165,000 steps wavenet-audio-speech-mel-00001.zip There is a bit noise at the end. Trying to fix this. I'm working on Korean Language.
wavenet-audio-mel-7.zip Here is another example which is Korean language.
G&L result, in Chinese mandarin. wav-100000-linear-16000.zip
I'm working on Korean Language.
Because it doesn't exist on your operating system or you don't like the voice?
All samples are quite impressive.
Because it doesn't exist on your operating system or you don't like the voice?
Not sure about your questtion
Not sure about your questtion
Do you use Tacotron because korean doesn't exist on your operating system like Microsoft Mark, Hazel, David or Zira or you just don't like the provided voices and want to have your own custom voice speak korean?
so, you are asking why i use Tacotron to synthesize Korean speech? Same as the reason why people use Tacotron to synthesize English speech. Sometime, provided voices are the last ones. Think about someone has gone, all you have is their recorded voices. Now you want to hear their voice again.
I am working on Korean language because I'm woking in a Korean company.
@begeekmyfriend is this generated from wavenet model? or Tacotron-1 model only? would you give more information about it plz?
@lkfo415579 I used Griffin Lim synthesizer with Tacotron-2 and this PR included https://github.com/Rayhane-mamah/Tacotron-2/pull/170.
@Yeongtae How long does your model take to synthesize a batch of sentences? My model takes about 30-35 minutes to synthesize a batch of text sentences (about 8 sentences). Is it too slow?
Can some of you answer my question at https://github.com/Rayhane-mamah/Tacotron-2/issues/193 ?
I have trained Tacotron with 77,000 steps, wavenet with 165,000 steps wavenet-audio-speech-mel-00001.zip There is a bit noise at the end. Trying to fix this. I'm working on Korean Language.
@tdplaza
I worked T2 based on Korean DB but I failed the WaveNet training. Could you let me know about your hparams settings for wavenet? or Is there anything that you change in the latest codes for waveNet?
@Yeongtae
Hey, I'm working T2 with Korean DB. If you can share your settings, I want to know about your korean embedding for text encoder and hparams settings.
This is my hyperparams setting. hparams.zip @osungv I have made few changes, not too much. You should post your problems, so others can help.
@tdplaza
Thank you for your uploading files.
After I do modified experiment with your hparams, and then I'll post the remained or addressed issues in here.
I do my experiment with 4 hour datasets. How's yours? Is any reason for my problem less quantity of my datasets?
I used KSS dataset: https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset. it is about 12 hours of data. Now I'm training Wavenet using only 1 hour of data. and this is the evaluated sound at 55000 time step. step-55000-pred.zip
Not sure it could be better in the future, just waiting.
@tdplaza
It means that you used 12 hours of data for spectrogram prediction network and 1 hours of them for waveNet, right?
And I also have a question. Did you train both models separately ?
no. I mean I have trained both Tacotron and Wavenet with 12 hours data. the result is quite good.
Now I'm trying both of them with less training data.
and Yes, I trained them separately. Because I just need they run fine, before constructing end-to-end system.
@tdplaza
How many steps for training spectrogram prediction network and waveNet? I find that you used very smaller training steps than default hparams
@tdplaza Thank you for your korea model parameters sharing. I understand that you are using 8 bit waveform output, when i tried this setting, the latest committed program will produce gradient explosion. (loss from 2.7 -> 0.0000000...) the program will stop before the first saving checkpoint. What version of the program that you are using?
@tdplaza
I have a question about 'use_lws'. Is that an important parameter for training WaveNet?
@osungv https://ieeexplore.ieee.org/document/7572016/ Weighted sum is about decomposition signal, use in stft (I guess) I do not have knowledge about sound area, but in my experience, using localized weighted sum results better sound. The number of training steps I have posted above.
@lkfo415579 yes, I synthesize wavenet in batch of 8 input. I use tacotron-2 old version and wavenet newest one.
@tdplaza
Thank you for your kindness.
I understand that 'lws' makes the sound quality better when generate waveform using linear or mel-spectrogram. The point I said is that the 'lws' affect the wavenet's training and the model's output quality.
How do you think about that?
@tdplaza what is your avg.loss? why do i train wavenet model get exploded every time?... Maybe my tensorflow's version is too low? (1.6.0) what is the version of yours?
@Yeongtae How long does your model take to synthesize a batch of sentences? My model takes about 30-35 minutes to synthesize a batch of text sentences (about 8 sentences). Is it too slow?
Wavenet synthesis is so slow. Because wavenet must predict 22100 samples to generate an wav which is 1 sec. A V100 generate 100~300 sample during 1 sec.
@lkfo415579 my tensorflow version is 1.9.0. but I dont think that is your problem. my Tacotron loss is ~0.20 and wavenet loss is ~ 1.5
@osungv I have no idea, we need someone has knowledge about sound and signal domain, I think.
G&L result, in Chinese mandarin. wav-100000-linear-16000.zip May I ask you what's the dataset you used for training?
@dream-will 1W clip Chinese mandarin records. Only for trivial test not for commercial use.
wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
@dream-will 1W clip Chinese mandarin records. Only for trivial test not for commercial use.
wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
@begeekmyfriend I'm a novice,thanks you for help. May I ask what's the text should be used? PhoneLabeling? ProsodyLabeling? or just use the Chinese Pinyin?
I just use Pinyin. But note the pinyin has no punctuation while in the Chinese mandarin text it has.
```shell wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
Hi @begeekmyfriend how can you get male voice using biaobei? May I ask which branch of your repository can get such voice of wav-100000-linear-16000.zip ? mandarin-griifin or mandarin-new? THanks!
That is my private corpus with copyright.
There are some samples in this video of my real-time voice cloning project: https://www.youtube.com/watch?v=-O_hYhToKoA
Can someone share a voice sample he created with this repository based on a given and/or a custom set of voice files