Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.26k stars 906 forks source link

Share a voice sample you created with this repository #191

Open ErfolgreichCharismatisch opened 6 years ago

ErfolgreichCharismatisch commented 6 years ago

Can someone share a voice sample he created with this repository based on a given and/or a custom set of voice files

Yeongtae commented 6 years ago

check #183

Thien223 commented 6 years ago

I have trained Tacotron with 77,000 steps, wavenet with 165,000 steps wavenet-audio-speech-mel-00001.zip There is a bit noise at the end. Trying to fix this. I'm working on Korean Language.

Yeongtae commented 6 years ago

wavenet-audio-mel-7.zip Here is another example which is Korean language.

begeekmyfriend commented 6 years ago

G&L result, in Chinese mandarin. wav-100000-linear-16000.zip

ErfolgreichCharismatisch commented 6 years ago
  1. Are the sentences in the training sets or are those new sentences?
  2. Do you have english or german samples?
  3. How many hours of input audio do you use?
ErfolgreichCharismatisch commented 6 years ago

I'm working on Korean Language.

Because it doesn't exist on your operating system or you don't like the voice?

All samples are quite impressive.

Thien223 commented 6 years ago

Because it doesn't exist on your operating system or you don't like the voice?

Not sure about your questtion

ErfolgreichCharismatisch commented 6 years ago

Not sure about your questtion

Do you use Tacotron because korean doesn't exist on your operating system like Microsoft Mark, Hazel, David or Zira or you just don't like the provided voices and want to have your own custom voice speak korean?

Thien223 commented 6 years ago

so, you are asking why i use Tacotron to synthesize Korean speech? Same as the reason why people use Tacotron to synthesize English speech. Sometime, provided voices are the last ones. Think about someone has gone, all you have is their recorded voices. Now you want to hear their voice again.

I am working on Korean language because I'm woking in a Korean company.

lkfo415579 commented 6 years ago

@begeekmyfriend is this generated from wavenet model? or Tacotron-1 model only? would you give more information about it plz?

begeekmyfriend commented 6 years ago

@lkfo415579 I used Griffin Lim synthesizer with Tacotron-2 and this PR included https://github.com/Rayhane-mamah/Tacotron-2/pull/170.

Thien223 commented 6 years ago

@Yeongtae How long does your model take to synthesize a batch of sentences? My model takes about 30-35 minutes to synthesize a batch of text sentences (about 8 sentences). Is it too slow?

ErfolgreichCharismatisch commented 6 years ago

Can some of you answer my question at https://github.com/Rayhane-mamah/Tacotron-2/issues/193 ?

osungv commented 6 years ago

I have trained Tacotron with 77,000 steps, wavenet with 165,000 steps wavenet-audio-speech-mel-00001.zip There is a bit noise at the end. Trying to fix this. I'm working on Korean Language.

@tdplaza

I worked T2 based on Korean DB but I failed the WaveNet training. Could you let me know about your hparams settings for wavenet? or Is there anything that you change in the latest codes for waveNet?

osungv commented 6 years ago

@Yeongtae

Hey, I'm working T2 with Korean DB. If you can share your settings, I want to know about your korean embedding for text encoder and hparams settings.

Thien223 commented 6 years ago

This is my hyperparams setting. hparams.zip @osungv I have made few changes, not too much. You should post your problems, so others can help.

osungv commented 6 years ago

@tdplaza

Thank you for your uploading files.

After I do modified experiment with your hparams, and then I'll post the remained or addressed issues in here.

I do my experiment with 4 hour datasets. How's yours? Is any reason for my problem less quantity of my datasets?

Thien223 commented 6 years ago

I used KSS dataset: https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset. it is about 12 hours of data. Now I'm training Wavenet using only 1 hour of data. and this is the evaluated sound at 55000 time step. step-55000-pred.zip

Not sure it could be better in the future, just waiting.

osungv commented 6 years ago

@tdplaza

It means that you used 12 hours of data for spectrogram prediction network and 1 hours of them for waveNet, right?

And I also have a question. Did you train both models separately ?

Thien223 commented 6 years ago

no. I mean I have trained both Tacotron and Wavenet with 12 hours data. the result is quite good.

Now I'm trying both of them with less training data.

and Yes, I trained them separately. Because I just need they run fine, before constructing end-to-end system.

osungv commented 6 years ago

@tdplaza

How many steps for training spectrogram prediction network and waveNet? I find that you used very smaller training steps than default hparams

lkfo415579 commented 6 years ago

@tdplaza Thank you for your korea model parameters sharing. I understand that you are using 8 bit waveform output, when i tried this setting, the latest committed program will produce gradient explosion. (loss from 2.7 -> 0.0000000...) the program will stop before the first saving checkpoint. What version of the program that you are using?

osungv commented 6 years ago

@tdplaza

I have a question about 'use_lws'. Is that an important parameter for training WaveNet?

Thien223 commented 6 years ago

@osungv https://ieeexplore.ieee.org/document/7572016/ Weighted sum is about decomposition signal, use in stft (I guess) I do not have knowledge about sound area, but in my experience, using localized weighted sum results better sound. The number of training steps I have posted above.

@lkfo415579 yes, I synthesize wavenet in batch of 8 input. I use tacotron-2 old version and wavenet newest one.

osungv commented 6 years ago

@tdplaza

Thank you for your kindness.

I understand that 'lws' makes the sound quality better when generate waveform using linear or mel-spectrogram. The point I said is that the 'lws' affect the wavenet's training and the model's output quality.

How do you think about that?

lkfo415579 commented 6 years ago

@tdplaza what is your avg.loss? why do i train wavenet model get exploded every time?... Maybe my tensorflow's version is too low? (1.6.0) what is the version of yours?

Yeongtae commented 6 years ago

@Yeongtae How long does your model take to synthesize a batch of sentences? My model takes about 30-35 minutes to synthesize a batch of text sentences (about 8 sentences). Is it too slow?

Wavenet synthesis is so slow. Because wavenet must predict 22100 samples to generate an wav which is 1 sec. A V100 generate 100~300 sample during 1 sec.

Thien223 commented 6 years ago

@lkfo415579 my tensorflow version is 1.9.0. but I dont think that is your problem. my Tacotron loss is ~0.20 and wavenet loss is ~ 1.5

@osungv I have no idea, we need someone has knowledge about sound and signal domain, I think.

dream-will commented 5 years ago

G&L result, in Chinese mandarin. wav-100000-linear-16000.zip May I ask you what's the dataset you used for training?

begeekmyfriend commented 5 years ago

@dream-will 1W clip Chinese mandarin records. Only for trivial test not for commercial use.

wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
dream-will commented 5 years ago

@dream-will 1W clip Chinese mandarin records. Only for trivial test not for commercial use.

wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

@begeekmyfriend I'm a novice,thanks you for help. May I ask what's the text should be used? PhoneLabeling? ProsodyLabeling? or just use the Chinese Pinyin?

begeekmyfriend commented 5 years ago

I just use Pinyin. But note the pinyin has no punctuation while in the Chinese mandarin text it has.

terryyizhong commented 5 years ago
```shell
wget https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

Hi @begeekmyfriend how can you get male voice using biaobei? May I ask which branch of your repository can get such voice of wav-100000-linear-16000.zip ? mandarin-griifin or mandarin-new? THanks!

begeekmyfriend commented 5 years ago

That is my private corpus with copyright.

CorentinJ commented 5 years ago

There are some samples in this video of my real-time voice cloning project: https://www.youtube.com/watch?v=-O_hYhToKoA