Kyubyong / tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Apache License 2.0
1.83k stars 436 forks source link

good results #30

Open ggsonic opened 7 years ago

ggsonic commented 7 years ago

https://github.com/ggsonic/tacotron/blob/master/10.mp3 based on your code, i can get clear voices,like the one above. text is : The seven angels who had the seven trumpets prepared themselves to sound. you can hear some of the words clearly. the main changes are about 'batch_norm',i use instance normalization instead. And i think there are problems in the batch_norms. And there may have something wrong about hp.r related data flows. But i don't have time to figure it out for now. later this week i will commit my code and thanks your great works!

Kyubyong commented 7 years ago

@ggsonic Nice work! If you share training time or curve as well as your modified code, it would be appreciated. Plus, instance normalization instead of batch normalization... interesting. Is anyone willing to review my normalize.py in modules.py? If you see my batch normalization code in modules.py, basically I use tf.contrib.layers.batch_norm. Many complains the performance of the batch normalization code in TF is poor. So, they officially recommend we use a fused version for that reason. But the fused batch normalization doesn't work for 3-d tensor. So I reshaped a 3d input tensor to 4d before applying the fused batch normalization and then recover its shape.

ggsonic commented 7 years ago

this is my tensorboard graph. i use your provided dataset and your default hyperparams. batch 32,lr 0.001,200 epochs,trained for 8 hours using one single gpu card to get the above result. After 70 epochs,you can hear some words. But It seems that after 110 epochs, learning rate should be lowered , i will test it this weekend. graph

candlewill commented 7 years ago

@ggsonic Nice work! Looking forward to your code modification.

ggsonic commented 7 years ago

committed! actually it is a simple idea and simple change.

basuam commented 7 years ago

@ggsonic Can you share the 10.mp3 with another resource (e.g. dropbox) I wasn't able to listen to your file. It was never downloaded.

Btw, your loss is still high, a loss with numbers such as ~0.001 is most likely to yield good results.

Spotlight0xff commented 7 years ago

@basuam Just click Download and then "save as" (right click, or ctrl-s). Also loss doesn't have to be low to yield good results, they are not necessarily directly related to perceptual quality. See e.g. this paper

basuam commented 7 years ago

@Spotlight0xff I did right click and saved it but the default player (Windows Media Player) and even VLC cannot reproduce it. The problem with right click (in my case) is that it is not saving the file, is saving the metadata related to the file, that's why I cannot reproduce it. I'm using MATLAB to read the file and it kind of worked, I just need the Frequency Sample to listen clearly to whatever you have listened to. Without it, it is reproduced slowly or fast and it sounds like either a demon talking or a chipmunk talking.

Btw, the small loss is out of experience when working with AUDIO. I glimpsed the paper you attached but, it is applied to images and not to sequential data. We cannot extrapolate that information unless someone has tested that for "generative models in sequential data".

dbiglari commented 7 years ago

That is terrific output. Your output sounds similar to the audio samples https://google.github.io/tacotron/ without the post processing network and the vanilla seq2seq. The buzz is present, but a voice is clearly hearable. Great job! I look forward to replicating your output.

greigs commented 7 years ago

@ggsonic I've trained your latest commit to 200 epochs (seems to be the default?). Here is the trained model: https://ufile.io/yuu7e And the samples generated by default when running eval.py on the latest epoch.. https://ufile.io/41djx

chief7 commented 7 years ago

Guys, I've run some training for around 32k global steps using @ggsonic s latest commit. I used a german dataset (pavoque in case someone is interested) and I've got some really cool results:

https://soundcloud.com/user-604181615/tacotron-german-pavoque-voice (Words are: "Fühlst du dich etwas unsicher").

I did the training on my GTX 1070 for around 12 hours.

Adjusted the sentence max characters according to my dataset as well as the wave length. I used Zero Masking for loss calculation.

I also observed that @ggsonic s instance normalization without the latest merged PRs gives better results.

DarkDefender commented 7 years ago

@chief7 did you do any dynamic step size stuff? While I don't speak german, I think that it sounds really good. Perhaps it could be even better if we adjusted step size (as in the paper)?

chief7 commented 7 years ago

No not yet. I definitely plan to do things like that but my first goal is or was to prove that it's worth spending all that time :D I'll keep you posted on my progress!

Am 11. Juni 2017 1:00:28 nachm. schrieb DarkDefender notifications@github.com:

@chief7 did you do any dynamic step size stuff? While I don't speak german, I think that it sounds really good. Perhaps it could be even better if we adjusted step size (as in the paper) the result would be even better?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/Kyubyong/tacotron/issues/30#issuecomment-307622103

basuam commented 7 years ago

@chief7 Can you try with the database that is mentioned in the github? I'm really impressed that "Fühlst du dich etwas unsicher" is so clear, a little bit robotic but still, it's so clear. I'm pretty sure it can be improved, it looks like you have found a way.

I would like to know if the dataset that we are using is not big enough in comparison to the dataset you have used. Hopefully, you can try with the Bible dataset. Thank you for your time.

ghost commented 7 years ago

@chief7 could you please tell us the size of your corpus?

chief7 commented 7 years ago

Sorry guys, I totally forgot to answer ... weekend.

I use a really small corpus. Around 5.3 hours of utterances. It seems to be enough to generate random new sentences.

So I guess the bible corpus isn't too small. I'll try to check as soon as possible but my computing resources are limited.

ggsonic commented 7 years ago

according to @barronalex 's code, the frames should be reshaped to output multiple non-overlapping frames at each time step. then in eval.py, reshape these frames back to the normal overlapping representation. an example data flow shown below. This seems to be the correct hp.r related data flow as the paper described. Look forward to get better results after doing so. [[ 1 1 1 1 1 1 1] [ 2 2 2 2 2 2 2] [ 3 3 3 3 3 3 3] [ 4 4 4 4 4 4 4] [ 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6] [ 7 7 7 7 7 7 7] [ 8 8 8 8 8 8 8] [ 9 9 9 9 9 9 9] [10 10 10 10 10 10 10] [11 11 11 11 11 11 11] [12 12 12 12 12 12 12] [13 13 13 13 13 13 13] [14 14 14 14 14 14 14] [15 15 15 15 15 15 15] [16 16 16 16 16 16 16] [17 17 17 17 17 17 17] [18 18 18 18 18 18 18] [19 19 19 19 19 19 19] [20 20 20 20 20 20 20] [21 21 21 21 21 21 21] [22 22 22 22 22 22 22] [23 23 23 23 23 23 23] [24 24 24 24 24 24 24] [25 25 25 25 25 25 25] [26 26 26 26 26 26 26] [27 27 27 27 27 27 27] [28 28 28 28 28 28 28] [29 29 29 29 29 29 29] [30 30 30 30 30 30 30] [31 31 31 31 31 31 31] [32 32 32 32 32 32 32] [33 33 33 33 33 33 33] [34 34 34 34 34 34 34] [35 35 35 35 35 35 35] [36 36 36 36 36 36 36]] reshaped to[[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 6 6 6 6 6 6 6 10 10 10 10 10 10 10 14 14 14 14 14 14 14 18 18 18 18 18 18 18] [ 3 3 3 3 3 3 3 7 7 7 7 7 7 7 11 11 11 11 11 11 11 15 15 15 15 15 15 15 19 19 19 19 19 19 19] [ 4 4 4 4 4 4 4 8 8 8 8 8 8 8 12 12 12 12 12 12 12 16 16 16 16 16 16 16 20 20 20 20 20 20 20] [21 21 21 21 21 21 21 25 25 25 25 25 25 25 29 29 29 29 29 29 29 33 33 33 33 33 33 33 0 0 0 0 0 0 0] [22 22 22 22 22 22 22 26 26 26 26 26 26 26 30 30 30 30 30 30 30 34 34 34 34 34 34 34 0 0 0 0 0 0 0] [23 23 23 23 23 23 23 27 27 27 27 27 27 27 31 31 31 31 31 31 31 35 35 35 35 35 35 35 0 0 0 0 0 0 0] [24 24 24 24 24 24 24 28 28 28 28 28 28 28 32 32 32 32 32 32 32 36 36 36 36 36 36 36 0 0 0 0 0 0 0]]

candlewill commented 7 years ago

@ggsonic In the latest pull request, I have added this feature. Could you please help check whether it is right?

ggsonic commented 7 years ago

@candlewill not exactly. A simple tf.reshape handling above example data will get [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......] but we need the non-overlapping frames [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ......] these will do the trick like paper said:

This is likely because neighboring speech frames are correlated and each character usually corresponds to multiple frames. Emitting one frame at a time forces the model to attend to the same input token for multiple timesteps; emitting multiple frames allows the attention to move forward early in training.

i think get_spectrograms function in utils.py should do this trick.

Kyubyong commented 7 years ago

@ggsonic Thanks! I guess you're right. I've changed the reduce_frames and adjust other relevant parts.

reiinakano commented 7 years ago

@ggsonic Have you tested the new commits regarding reduce_frames and seen if it works better?

Could you explain more why it's [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ...... and not [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......]?

In my mind, the latter seems more correct. The paper said "neighboring speech frames are correlated" and therefore, these neighboring frames are the ones that must be grouped and predicted together so the attention can move forward faster. Unless [1 1 1 1 1 1 1] and [5 5 5 5 5 5 5] are considered "neighboring frames" while [1 1 1 1 1 1 1] and [2 2 2 2 2 2 2] are not, the first reshaping (the one currently committed) does not make much sense to me.

ggsonic commented 7 years ago

@reiinakano In my experiments, the new reduce_frames method can make the training process stable, while the former method always had somewhat "mode collapse". But the new method might need more global steps to get better results and i am still waiting .

reiinakano commented 7 years ago

@ggsonic Okay, am also running it right now with default hyperparameters. Do you use batch normalization or should I just stick with instance normalization? Currently at epoch 94 with batch norm and no voice.. :(

Edit: iirc, mode collapse is for GANs. What do you mean when you say mode collapse in this context?

Edit2: My loss curves so far epoch 94. Any comments?

screenshot from 2017-06-19 11 51 03

chief7 commented 7 years ago

I didn't have much luck with the new features introduced during the last commits. I do get the best results with 7ed2f209233c307b968c7080bc36fda3a70f6707 by ggsonic and the loss curves are similar to the ones posted by @reiinakano - especially when it comes to numbers. The sample I uploaded last week was sampled from the model while loss was around 1.2... just in case someone's interested.

reiinakano commented 7 years ago

@chief7 Thanks for sharing. Perhaps we should rethink the reduce_frames method? What is your opinion on the "neighboring speech frames" discussion.

Update: Am at epoch 115 and still no voice can be heard

chief7 commented 7 years ago

As far as I understand what the paper says they're predicting frames 1..r in at once. If that's correct then the current state of the reduce_frames method is not correct. Though I'm still digging into this ...

ghost commented 7 years ago

@reiinakano It is possible that after certain epochs your learning rate is too high for the model to converge, hence the spiking up, going down, repeat.

Here are some of the best results I've obtained after training it for about 2 days (before reduce frames commit with instance normalization). It seems that reduce frames is actually making the convergence take longer. Here is the script, 2: Talib Kweli confirmed to All Hip Hop that he will be releasing an album in the next year. 8: The quick brown fox jumps over the lazy dog 11: Basilar membrane and otolaryngology are not auto correlations. 12: would you like to know more about something. 17: Generative adversarial network or variational auto encoder. 19: New Zealand is an island nation in the south western pacific ocean. 31: New Zealand's capital city is wellington while its most populous city is auckland. https://www.dropbox.com/s/o1yhsaew8h2lsix/examples.zip?dl=0

reiinakano commented 7 years ago

@minsangkim142 What is your learning rate? I have reverted to the 7ed2f20 commit by @ggsonic and am at epoch 157. I am using the default learning rate of 0.001. So far, it's been better than with reduce_frames (can hear a very robotic voice) but not really hearing actual words yet. The loss curve also looks much more steady now gradually going down.

ghost commented 7 years ago

I started with 0.001 then moved it down to 0.0005, 0.0003 and 0.0001 as suggested by the original paper, except I changed my learning rate every time the model spiked (suggesting that it may have jumped out of the local minimum) in which case I reverted the model before it spiked and changed the learning rate. I started hearing some clear voices after 30k global timesteps with dataset size of 4 hrs of utterance which is about 190 epochs.

Also I used .npy objects and np.memmap instead of librosa.load which increased the globalsteps/second by about twice the original rate.

basuam commented 7 years ago

@minsangkim142 can you share with us those clear voices? Thank you for your help and knowledge.

chief7 commented 7 years ago

@minsangkim142 did you use a more sophisticated save/restore logic or did you go for the one checked in here? Or did you turn the whole process of learning into a supervised process and adjusted everything manually?

ghost commented 7 years ago

@chief7 I wrote a shell code to save the training folder every once a while and manually chose the right one, changed the lr and trained again when I had time.

chief7 commented 7 years ago

@minsangkim142 interesting approach... And it fits with my observation that one should be able to sample reasonable voice outputs from the model at around 30k global steps.

I'm thinking about modifying the training process to monitor the loss and keep checkpoints conditionally.

lifeiteng commented 7 years ago

@minsangkim142 which commit do you base on? not clear about before reduce frames commit with instance normalization.

aelbialy-tbox commented 7 years ago

Any thoughts on if you can get better results with one person's voice versus multiple different people?

greigs commented 7 years ago

@ahmed-tbox The white paper says they used a single speaker, I'd imagine it would give much cleaner results than using multiple voices.

chief7 commented 7 years ago

The network learns features like emphasis from the given samples so it's probably not as good as using a single speaker dataset. Deep Voice 2 though seems to address this issue

ghost commented 7 years ago

@lifeiteng sorry the commit I used is https://github.com/Kyubyong/tacotron/commit/7ed2f209233c307b968c7080bc36fda3a70f6707 .

@ahmed-tbox have a look at wavenet (https://deepmind.com/blog/wavenet-generative-model-raw-audio/) they use global conditioning to specify which speaker is speaking at each time, and share networks parameters to output different voices at different inferences. This could possibly be done in tacotron by adding a speaker ID to the input tokens or something similar, in which case we will have access to more data because we are not restricted by a single speaker dataset anymore. Although I agree with @chief7 that the quality won't be as good as single speaker model (in my opinion) as it will take more time for networks to converge let alone if it ever converges at all.

ggsonic commented 7 years ago

sorry for the late response. Finally i could get clear voices using the latest reduce_frame method (with some code changes ), i think this method is more close to the original paper's method. you can hear the example:https://github.com/ggsonic/tacotron/blob/master/10_reduce.mp3 ( meanwhile i update the former 10.mp3 file, which is tuned by choosing proper learning rate, using former instance norm code.)you can compare the two samples' quality. (as for reduce_frame one,may not be so good,but a proof that we are in right direction, more tuning is needed).For now some code changes is needed, there's one more step when doing reduce_frame trick,you need feed in every hp.r-th frame to decode_inputs(just as the paper said).that's to say you got target frames like this [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 6 6 6 6 6 6 6 10 10 10 10 10 10 10 14 14 14 14 14 14 14 18 18 18 18 18 18 18] [ 3 3 3 3 3 3 3 7 7 7 7 7 7 7 11 11 11 11 11 11 11 15 15 15 15 15 15 15 19 19 19 19 19 19 19] [ 4 4 4 4 4 4 4 8 8 8 8 8 8 8 12 12 12 12 12 12 12 16 16 16 16 16 16 16 20 20 20 20 20 20 20] [21 21 21 21 21 21 21 25 25 25 25 25 25 25 29 29 29 29 29 29 29 33 33 33 33 33 33 33 0 0 0 0 0 0 0] [22 22 22 22 22 22 22 26 26 26 26 26 26 26 30 30 30 30 30 30 30 34 34 34 34 34 34 34 0 0 0 0 0 0 0] [23 23 23 23 23 23 23 27 27 27 27 27 27 27 31 31 31 31 31 31 31 35 35 35 35 35 35 35 0 0 0 0 0 0 0] [24 24 24 24 24 24 24 28 28 28 28 28 28 28 32 32 32 32 32 32 32 36 36 36 36 36 36 36 0 0 0 0 0 0 0]],but you only feed into one frame like[17 17 17 17 17 17 17][18 18 18 18 18 18 18].... and force the neural networks to predict all the multiple(hp.r) frames. and this is more stable but need more global steps. I will do more experiments trying to get better results.

lifeiteng commented 7 years ago

@minsangkim142 In May, I implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, Learned awesome attention/alignments on the VCTK corpus. 00077_spk0015_p241_154_attention 00077_spk0015_p241_154_linear_spectrums reduction factor = 4 (older style) but the audio quality is not so good (busy on other things nowadays).

lifeiteng commented 7 years ago

@ggsonic In my opinion, 10_reduce.mp3 isn't as good as 10.mp3.

jacksmithe commented 7 years ago

@minsangkim142 - On which dataset have you trained?

TherapyBox commented 7 years ago

Hi, does anyone know if there is a dependancy on file size when training? For example, multiple files each around 1h of recordings vs smaller files each around 5 min, etc.

reiinakano commented 7 years ago

@TherapyBox those are way too long. Tacotron trains on utterances, so only 10-30 seconds at max (not sure about exact limit for this implementation)

jaron commented 7 years ago

@minsangkim142 I see the same spiking you describe, this on a 594 utterance training set (Arctic SLT, set A). I'm getting a muffled voice, but the fine detail isn't there, so I'm going to try reducing the learning rate too.

screen shot 2017-06-22 at 10 27 09

I was also interested to hear what you said about converting the sounds to .npy objects and loading with np.memmap instead of librosa.load to speed things up. Have you considered submitting that as a pull request?

TherapyBox commented 7 years ago

In terms of transcriptions, the format in Bible transcripts is: Genesis/Genesis_1-1,In the beginning God created the heavens and the earth.,4 .9 So location / text / length of the audio file Is length of the audio file necessary? I am trying to implement tacotron on my own dataset and I am trying to understand best format for transcriptions

GunpowderGuy commented 7 years ago

So , everyone agrees that a multi speaker tacotron will probably have worse quality ( but be more easily trainable , which could offset it ) , even with the modifications detailed in deep voice 2 ?

chief7 commented 7 years ago

I've had time during the last days to confirm the following: Instance Normalization and Zero Masking is the way to go. I've trained on the pavoque set again and I get clear voices after ~30k global steps.

DarkDefender commented 7 years ago

@chief7 are the results better than the previous example you posted?

jarheadfa commented 7 years ago

@chief7 Inspired by the samples you shared (Good job!), I'm also trying to work with the pavouqe dataset. Since I'm getting samples which are not as good as you shared I wanted to ask some questions:

  1. I'm using the https://github.com/marytts/pavoque-data/releases dataset. I noticed that it has a few modes (happy, angry, etc...), in total 10 hours, and you mentioned you are using 5.3 hours. Which modes have you chosen?
  2. The current code version removes all non english chars. Did you do some text normalization stuff? Can you share?
  3. What are the values of the max wav length and max text length you set?
  4. Did you do any preprocessing to the wav files?
  5. You are sampling when the loss is about 1.2 - are you referring to the self.mean_loss = self.mean_loss1 + self.mean_loss2 ?
  6. Can you share your code modification?

Thanks!

GunpowderGuy commented 7 years ago

But what about going multi speaker now ? , to be able to exploit exponentially more data ( and to not be left behind , tacotron will probably be based on deep voice 2 version )