Kyubyong / tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Apache License 2.0
1.83k stars 435 forks source link

good results #30

Open ggsonic opened 7 years ago

ggsonic commented 7 years ago

https://github.com/ggsonic/tacotron/blob/master/10.mp3 based on your code, i can get clear voices,like the one above. text is : The seven angels who had the seven trumpets prepared themselves to sound. you can hear some of the words clearly. the main changes are about 'batch_norm',i use instance normalization instead. And i think there are problems in the batch_norms. And there may have something wrong about hp.r related data flows. But i don't have time to figure it out for now. later this week i will commit my code and thanks your great works!

GuangChen2016 commented 7 years ago

@lifeiteng Hello, very nice jobs. You mentioned that you implemented (base on seq2seq ) the Deep Voice2's multi-speaker Tacotron, and learned awesome attention/alignments on the VCTK corpus. I am quite interested in this, and I also want to know:

  1. Did you implement multi-speaker Tacotron based on this repo, and just replace the seq2seq module? Did you implement the single-speaker Tacotron as well?
  2. Could you provide some synthesized voices (multi-speaker and single speaker)?
  3. Any other suggestions or tips to improve the results? Thank you very much.
chief7 commented 7 years ago

@jarheadfa Nice to see that you're trying on the pavoque set!

Concerning your questions:

  1. I used only the neutral voices as every other style has different emphasis and stuff like that.
  2. Yeah I did some normalization. I replaces all the German umlauts like ö to oe etc. I have a very crappy script that's able to do so. I'll upload it as soon as possible.
  3. In current revision the max_wave_length is gone. Previously, I had it set to 16.8. max_len is set to 247
  4. I did no preprocessing.
  5. I'm talking about the combination of the two. Though since the last commits, loss typically is around 0.02 to 0.04

I'll share everything I have as soon as there's something that's worth sharing. Right now, there's a lot work you have to do manually.

Here's what I did:

I noticed that you may have to adjust hyperparams after some time. Here's my very brittle process "how to train the pavoque set". Please be aware: This is by no means complete or even for sure. It's just my experience. Start training with no zero masking and lr = 0.001 until you are able to hear words and random noise in the samples. Then you should adjust to lr = 0.0005 until the words become clearer. After some more time I use zero masking as it seems to avoid the random noises in the sample.

jarheadfa commented 7 years ago

@chief7 Thank you for your response! A couple of things:

  1. On which commit does your recommendations based upon?
  2. Other than replacing ö to oe , did you also replaced äöüß or did you removed the'm?
  3. You mention random shuffle - did this helped?
  4. What was the loss in the time you had the best samples?
  5. On which global step (give or take) did you start hearing words and adjusted the learning rate? Did you do any other learning rate modifications apart from lowering it to 0.0005?
  6. You said you start using zero masking, On which global step was that?

Thanks again :)

chief7 commented 7 years ago

@jarheadfa I'm on vacation right now - I'll upload it as soon as possible ;) aloha!

TherapyBox commented 7 years ago

hi, has anyone tried training the model on two or more voices rather than a single voice? That would open up much bigger data sets, but I have doubts about voice's quality, so if anyone can share their experience that would be interesting

TherapyBox commented 7 years ago

Hi, does anyone know if I can keep adding data while the model is training? I have 5h of audio ready, but will have 10h by the end of next day, so I am thinking of feeding 5h in and then adding 5h more while in the process. Has anyone tried this approach and knows if it works fine?

TherapyBox commented 7 years ago

Hi, for some reason this bit of code is not running on our dataset: "for step in tqdm(range(g.num_batch), total=g.num_batch, ncols=70, leave=False, unit='b'): sess.run(g.train_op)"

we got about 170 audio files about 30 seconds each (just a sample set to test whether training works. Full data set is much bigger). The script finishes running the main program with the print("Done"), but I don't think there's any training happening as I get the error. Any help?

GunpowderGuy commented 7 years ago

Amother reason to go multi speaker now https://twitter.com/throwawayndelet/status/887418621877772291

TherapyBox commented 7 years ago

@chief7 it would be interesting to discuss your findings and potential use of them. Please drop me a line at kkolbus@therapy-box.co.uk if you are interested

chief7 commented 7 years ago

Will do so ... but not before the weekend.

jpdz commented 7 years ago

@ggsonic hi, have you tried to feed in every hp.r-th frame to decode_inputs? Thank you for sharing some details.

jackchinor commented 7 years ago

@ggsonic I run your code of multiple gpu version, and get 200 epoch model, but when I run eval.py to test the result, there is a notfounderror I can't figure it out, can you tell me how to fix it, I really want to see my training result, thanks so much

matanox commented 7 years ago

@ggsonic I listened to your sample output file from the top of this thread/issue earlier this year (10.mp3), and it sounds nothing like the sleek audio outputs showcased along the original Tacotron article (the voice is metallic, and there's a long period of noise appended to the intelligible part of the speech).

At this point I'm no longer sure how representative those original samples from google have been, v.s. the possibility that they've been cherry picked from a much larger pool of mediocre outputs, before being showcased. Or perhaps the internal dataset they have used had some auspicious properties to it, absent from the public datasets available for training.

I am just wondering, whether you have ultimately been able to accomplish results more like those of the original authors, in later training? That aside, the deep voice 2 article from Baidu claims a significant improvement by changing the architecture a little bit, I wonder whether that change is part of this repo.

I hope to be creating a high quality and perfectly aligned dataset (non-English) for training with this architecture, but I seem to be failing at getting at evidence, that it is really going to work to the standard of the google showcased samples. I wonder if anyone here has much insight about the reproducibility of the google results yet...

rafaelvalle commented 6 years ago

@jaron and @minsangkim142 : did you find out how to fix the spikes in loss or why they appear? I wonder if it is related to gradients suddenly "exploding" whenever a silence-like spectrogram or outlier sequence, e.g. short sentences or repeated words, appears.

jaron commented 6 years ago

@rafaelvalle I noticed the same loss spikes, but I'm working on another project at the moment so haven't had time to investigate.

But I think your intuition is correct. The Tacotron paper mentions a 50 ms frame length and a 12.5 ms frame shift to look 4 intervals ahead. But phonemes can be very short. I wonder if the more examples we encounter, the greater the risk we begin to "hear" things that aren't actually part of the set of phonemes in common speech. Which might be why people in the wood sometimes hear "words" when the wind is blowing through the branches...

rafaelvalle commented 6 years ago

I'm using the architecture for a different problem and will report here if I found out what's correlated with the spikes!

rafaelvalle commented 6 years ago

This is a run with gradient clipping at 1.0 norm of gradients, LR 0.001. My gradients are vanishing, not exploding. I'll look into the data to see what's producing it.

ITER: LOSS SUM(GRADNORM) 1885: 1.586263299 4.938561218 1886: 1.683523536 5.001653109 1887: 1.629835010 2.163018592 1887: 1.629835010 2.163018592 1888: 1.616167665 4.112304520 1889: 2.122112274 0.000000000 1890: 2.226523161 3.525334871 1891: 2.122971773 4.609711523 1892: 2.150014400 0.000000000 1893: 2.201522827 0.000000000 1894: 2.185153008 4.181719988 1895: 2.158463240 0.000000000 1896: 2.221507549 0.000000000 1897: 2.052042246 0.000000000 1898: 2.191329002 0.000000000 1899: 2.194499731 nan 1900: nan nan 1901: nan nan

rafaelvalle commented 6 years ago

Continuing the analysis, the standard deviation of the activations of the pre layer and decoder layer of the decoder RNN increase, and keep increasing, considerably when there's a spike on the loss, possibly due to a "bad" input going into the pre layer of the decoder!