Rayhane-mamah commented 6 years ago

this umbrella issue tracks my current progress and discuss priority of planned TODOs. It has been closed since all objectives are hit.

Goal

[x] achieve a high quality human-like text to speech synthesizer based on DeepMind's paper
[x] provide a pre-trained Tacotron-2 model (Training.. checking this still)

Model

Feature Prediction Model (Done)

[x] Convolutional-RNN encoder block
[x] Autoregressive decoder
[x] Location Sensitive Attention (+ smoothing option)
[x] Dynamic stop token prediction
[x] LSTM + Zoneout
[x] reduction factor (not used in the T2 paper)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

[x] 1D dilated convolution
[x] Local conditioning
[x] Global conditioning
[x] Upsampling network (by transposed convolutions)
[x] Mixture of logistic distributions
[x] Gaussian distribution for waveforms modeling
[x] Exponential Moving Average (train + synthesis)

Scripts

[x] Feature prediction model: training
[x] Feature prediction model: natural synthesis
[x] Feature prediction model: ground-truth aligned synthesis
[x] Wavenet vocoder model: training (ground truth Mel-Spectrograms)
[x] Wavenet vocoder model: training (ground truth aligned Mel-Spectrograms)
[x] Wavenet vocoder model: waveforms synthesis
[x] Global model: synthesis (from text to waveforms)

Extra (optional):

[x] Griffin-Lim (as an alternative vocoder)
[x] Reduction factor (speed up training, reduce model complexity + better alignment)
[x] Curriculum-Learning for RNN Natural synthesis. paper
[x] Post processing network for Linear Spectrogram mapping
[x] Wavenet with Gaussian distribution (reference)

Notes:

All models in this repository will be implemented in Tensorflow on a first stage, so in case you want to use a Wavenet vocoder implemented in Pytorch you can refer to this repository that shows very promising results.

Rayhane-mamah commented 6 years ago

Just putting some notes about the last commit (7e67d8b43cf9c81c0abd1926c2288c9d68ab2d4e) to explain the motivation behind such major changes and to verify with the rest of us that I didn't make any silly mistakes ( as usual.. )

This commit mainly had 3 goals: (other changes are minor)

Clean the code: added some comments and changed the code architecture to use tensorflow's attention wrapper in the objective of reducing the number of files. Even though I tried getting rid of the "custom_decoder" and the custom "dynamic_decode" I'm currently using, after diving deep into tensorflow's implementation, I found that it was impossible to adapt my dynamic prediction to use tensorflow's ready to use "BasicDecoder" and "dynamic_decode" with my custom helpers.
Correct the Attention: Even though they called it "Location sensitive attention" in the paper, they didn't mean the "location based attention" we know but instead, they were mentioning the "hybrid" attention. For this hypothesis i'm relying on this part of the paper "We use the location sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature." which supposes they took the original bahdanau attention and added location features to it."
Added "map" (log) file at synthesis: Mainly, this will map each input sequence, to the corresponding real Mel-Spectrograms and generated ones.

I also want to bring attention to these few points (in case someone want to argue them):

I impute finished sequences at decoding time to ensure model doesn't have to learn to predict paddings (which will probably result in extra noise in generated waveforms later)
If I'm not mistaking, the paper writers used the projection to a scalar+sigmoid to explicitly predict a "" probability since our feature prediction model isn't performing some classification task where he can choose to output a real . I like to think of it as creating a small binary classifier that chooses when to stop decoding since vanilla decoder can't output a frame with full round zeros.
I am only using 512 LSTM units for decoder in each layer as i supposed "The pre-net output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units." means that the 1024 are distributed across the 2 layers.

imdatceleste commented 6 years ago

Hi @Rayhane-mamah, using 7e67d8b I got an error that (in the end). You change he call -parameter name from previous_alignments to state in attention.py:108

Was that on purpose? AttentionWrapper from TF requires the parameter to be named previous_alignments. (Using TF 1.4)

Changing that back to previous_alignments results in other errors:

ValueError: Shapes must be equal rank, but are 2 and 1 for 'model/inference/decoder/while/BasicDecoderStep/decoder/output_projection_wrapper/output_projection_wrapper/concat_lstm_output_and_attention_wrapper/concat_lstm_output_and_attention_wrapper/multi_rnn_cell/cell_0/cell_0/concat_prenet_and_attention_wrapper/concat_prenet_and_attention_wrapper/attention_cell/MatMul' (op: 'BatchMatMul') with input shapes: [2,1,?,?], [?,?,512].

Any ideas?

Rayhane-mamah commented 6 years ago

Hi @imdatsolak, thanks for reaching out.

I encountered this problem on one of my machines, updating tensorflow to latest version solved the problem. (I changed the parameter to state according to latest tensorflow attention wrapper source code, I also want to point out that I am using TF 1.5 and confirm that attention wrapper works with "state" for this version and later).

Try updating tensorflow and keep me notified, I'll look into it if the problem persists.

imdatceleste commented 6 years ago

@Rayhane-mamah, I tried with TF 1.5, which didn't work. Looking into TF 1.5, the parameter was still called previous_alignments. The parameter's name changed in TF 1.6 to state, so installed TF 1.6 and it works now. Thanks!

danshirron commented 6 years ago

Upgrading to TF1.6 (was 1.5) solved issue (TypeError: call() got an unexpected keyword argument 'previous_alignments') for me.

Rayhane-mamah commented 6 years ago

@imdatsolak, yes my bad. @danshirron is perfectly right. I checked that my version is 1.6 too (i don't remember updating it Oo)

Rayhane-mamah commented 6 years ago

Quick notes about the latest commit (7393fd505e07fe774e4aedb6a20f275e4a0619df):

Corrected parameters initialization which was causing gradients explosion in some cases (using a xavier initializer)
Added gradients norm visualization
Replaced the learning rate decay to start from step 0 (instead of 50000) and added a visualization of the learning rate
Corrected typos in "hparams.py"
Changed alignment plot directories and added real + predicted Mel-Spectrogram plots (each 100 training step)
Added a small jupyter notebook where you can use griffin-lim to reconstruct phase and listen to the audio reconstructed from generated Mel-spectrograms (just to control the model learning state without paying much attention to audio quality as we will use wavenet as a vocoder)
Started using a reduction factor (despite not being used in tacotron-2) as it speeds training process (faster computation) and allows for faster alignment learning. (current: r=5, feel free to change it).
Corrected typos in preprocessing (Make sure to restart the preprocessing before training your next model)

Side notes:

Alignment should appear at step 15k and audio becomes quite audible at 4~5k steps (using 32 batch size) but fully understandable around 8~10k steps.
Mel-spectrograms seem very blurry at the beginning, and despite the loss not decreasing much (you may even feel it's constant after 1k steps) the model will still learn to improve speech quality so be patient.

If there are any problems, please feel free to report them, I'll get to it as fast as possible

Rayhane-mamah commented 6 years ago

Quick review of the latest changes (919c96a88be97714726e5752123ef3fcb555bf9b):

Global code reorganization (for easier modifications and it's just cleaner now)
Network Architecture review: Since there are some unclear points in the paper, I am doing my best to collect enough information from all related works, and trying to put them all together to get reasonable results. The current architecture is the closest I got to the described T2 (I think.. ^^')
Pulled the prediction out of the decoder and got rid of the custom "dynamic_decode".
Reduced the model size and added new targets (stop token targets are now prepared in the feeder)
Adapted prediction to work properly with the reduction factor. (multiple predictions at each decoding step)
Doubled the number of LSTM units in the decoder and number of neurones in the prenet. On the other hand, I removed the separate attention LSTM and started using the first decoder LSTM hidden state as a query for the attention.

Side Notes:

Despite slightly reducing the memory usage of the model, impact of training speed are still not clear enough. Forward propagation got slightly faster and back propagation slightly slower. But the overall speed seems the same.

If anyone tries to train the model, please think about providing us with some feedback. (especially if the model needs improvement)

ohleo commented 6 years ago

Hi @Rayhane-mamah, thanks for sharing your work.

I cannot get a proper mel-spectrogram prediction and audible wave by Evaluation or Natural synthesis(No GTA) at step 50k. All hparams are same with your code(with LJSpeech DB) and wave are generated by mel-prediction, mel_to_linear, Griffin-Lim reconstruction. GTA synthesis generates audible results.

Is it works in your experiments?

I attached some Mel-spectrogram plot samples with following sentences.

1 : “In Dallas, one of the nine agents was assigned to assist in security measures at Love Field, and four had protective assignments at the Trade Mart."

Ground Truth

GTA

Natural(Eval)

2 : ”The remaining four had key responsibilities as members of the complement of the follow-up car in the motorcade."

Ground Truth

GTA

Natural(Eval)

3 : “Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car."

Ground Truth

GTA

Natural(Eval)

Rayhane-mamah commented 6 years ago

Hello @ohleo, thank you for trying our work and especially for sharing your results with us.

The problem you're reporting seems the same as the one @imdatsolak mentionned here.

There are two possible reasons I can think of right now:

Your model after 50k steps still has an ugly allignment (hopefully this commit takes care of that). That's the most probable reason i think.
I am unknowingly and indefinitely passing the first frame to the decoder in my code. I will triple check this today ( in case TacotestHelper is the cause).
It can't be possibly doing a massive overfit on the first generated frame can it ? Oo the output looks the same for the three sentences!

The fact that GTA is working fine highly supposes the problem is in the helper.. I will report back to you later tonight. If your setup is powerful enough, you could try to retrain the model using the latest commit or wait for me to test it myself a bit later this week.

In all cases, thanks a lot for your contribution, and hopefully we get around this issue soon.

unwritten commented 6 years ago

Hello, @Rayhane-mamah ,

do you get any further information running using the latest code?

Rayhane-mamah commented 6 years ago

Hello @unwritten, thanks for reaching out. I believe you asked about GTA as well? I'm just gonna answer it anyway in case anyone gets the same question.

GTA stands for Ground Truth Aligned. Synthesizing audio using GTA is basically using teacher forcing to help the model predict Mel-spectrograms. If you aim to use the generated spectrograms to train some vocoder like Wavenet or else, then this is probably how you want to generate your spectrograms for now. It is important to note however that in a fully end-to-end test case, you won't be given the ground truth, so you will have to use the "natural" synthesis where the model will simply look at its last predicted frame to output the next one. (i.e: with no teacher forcing)

Until my last commit, the model wasn't able to use natural synthesis properly and I was mainly suspecting the attention mechanism because, well, how is the model supposed to generate correct frames if it doesn't attend to the input sequence correctly.. Which brings us to your question.

So after a long week end of debugging, it turned out that the attention mechanism is just fine, and that the problem might have been with some Tensorflow scopes or w/e.. (I'm not really quite sure what was the problem). Anyway, after going back through the entire architecture, trying some different preprocessing steps and replacing zoneout LSTMs with vanilla LSTMs, the problem seems to be solved (Now I am not entirely 100% sure as I have not yet trained the model too far, but things seem as they should be in early stages of training).

I will update the repository in a bit (right after doing some cleaning), and there will be several references to papers used that the implementation was based on. These papers will be in pdf format in the "papers" folder, like that it's easier to find if you want to have an in depth look of the model.

I will post some results (plots and griffin lim reconstructed audio) as soon as possible. Until then, if there is anything else I can assist you with, please let me know.

Notes:

It is possible for now to use the griffin lim algorithm (using the provided notebook) to do a basic inversion of the mel spectrogram to waveform. The quality won't be as good as Wavenet's but it's mainly for test and debug purposes for now.
Generated spectrograms using the "synthesize.py" will be stored under "output" folder. Depending on the synthesis mode you used, there will many possible sub-folders.
I have not yet added the Wavenet vocoder to this repository as there are more important things at the moment like ensuring a good spectrograms generation. There are good Wavenet implementations out there that are conditioned of Mel-spectrograms like r9y9/wavenet.

Rayhane-mamah commented 6 years ago

Hello again @unwritten.

As promised I pushed the commit that contains the rectifications (c5e48a036a48ce23075ddb31d0340e81f01f7418).

Results, samples and pretrained model will be coming shortly.

PetrochukM commented 6 years ago

@Rayhane-mamah

Results, samples and pretrained model will be coming shortly.

Trying to understand "shortly", do you think they'll be out today, next week or next month?

Rayhane-mamah commented 6 years ago

@PetrochukM, I was thinking more like next year.. that still counts as "shortly" I guess..

Enough messing around, let's say it will take a couple of days.. or a couple of weeks :p But what's important, it will be here eventually.

imdatceleste commented 6 years ago

Hi everybody, here is a new dataset that you can use to train Speech Recognition and Speech Synthesis: M-AILABS Speech Dataset. Have fun...

unwritten commented 6 years ago

@Rayhane-mamah thanks for the work; I have tried to train the latest commit maybe before 81b657d, I pulled the code about 2 days ago, currently it run to about 4k, the align doesn't look like to be there, I will try the newest code though: step-45000-pred-mel-spectrogram step-45000-real-mel-spectrogram

step-45000-align

Rayhane-mamah commented 6 years ago

Hi @imdatsolak, thank you very much for the notification. I will make sure to try it out as soon as possible.

@unwritten, I experienced the same issue with the commit you're reporting.

If your really don't want to waste your time and computation power for failed tests, you could wait a couple of days (at best) or a couple of weeks (at worst) until I post a 100% sure to work model, semi-pretrained which you can train further for better quality (I don't have the luxury to train for many steps at the moment unfortunately).

Thank you very much for your contribution. If there is anything I can help you with or if you notice any problems, feel free to report back.

maozhiqiang commented 6 years ago

@Rayhane-mamah thanks for the work; why Loss descends very quickly than tacotron1？

Rayhane-mamah commented 6 years ago

Hello @maozhiqiang, thank you for reaching out.

In comparison to Tacotron-1 which uses simple summed L1 loss function (or MAE), we use (in Tacotron-2) a summed L2 loss function (or MSE). (The sum is in both cases is of predictions before and after the postnet). I won't pay much attention to the average along batch here for simplicity.

Let's take a look at both losses: (h(xi) stands for the model estimation)

L1 = ∑i |yi−h(xi)| L2 = ∑i (yi−h(xi))²

The L1 loss is typically computing the residual loss between your model's predictions and the ground truth and returning the absolute value as is. The L2 loss however squares this error for each sample instead of simply returning the difference loss. Now consider that your model starts for an initial state t0 where weights and biases are initialized randomly. Naturally the first model output will be totally random which results in a high loss for L1 which is even more amplified by the square operation in L2 (I'm supposing the initial loss is greater than 1). After a few steps of training, the model should start emitting outputs that are in range of the correct predictions (especially if your data is [0, 1] normalized like in our case, the model doesn't take long to start throwing outputs in that range). This can be detected by the blurry, yet seem close to real, spectrograms the model is providing each 100 steps. At this stage, the L1 and L2 loss functions start showing very different values. Try taking a difference (yi - h(xi)) smaller than 1 and compute its square, naturally you will get an even smaller value. So, once the model starts giving outputs in the correct range, L2 loss is already very low in comparison to L1 loss which does not compute the square.

Note: Next the model will only have to improve the vocal patterns, which consist of small adjustments, which explains why the loss then starts decreasing very slowly.

So mainly, what I'm trying to point out here is that we are not using the same loss function as in Tacotron-1 which I believe is the main reason for such difference. However, there are some other factors like the difference in the model architecture, or even the difference in target itself. (In Tacotron-1, we predict both Mel-spectrogram and Linear spectrograms using the post-processing net).

I believe this answers your question? Thanks again for reaching out, if there is anything else I can assist you with, please let me know.

maozhiqiang commented 6 years ago

hello @Rayhane-mamah thanks for your detailed reply, I started training with your code these days， Here's my training figure step-27000-align step-27000-pred-mel-spectrogram step-27000-real-mel-spectrogram When I run more than one hundred thousand times, the difference between pred-mel and real-mel is still great,but loss More than 0.03 or smaller， Is there any problem in this? Look forward to your reply ，thank you

a3626a commented 6 years ago

Here is empirical evidence for @Rayhane-mamah 's reasoning.

default

Yellow line uses loss function of tacotron1, brown line uses loss function of tacotron2. Loss of brown is about square of loss of yellow. (and they intersect at 1.0!)

a3626a commented 6 years ago

Hello. I'm working on Tacotron2, and worked based on Keithito's implementation. Recently, I am trying to move to your implementation for some reasons.

There is one fundamental difference between @Rayhane-mamah 's TacotronDecoderCell and tensorflow.contrib.seq2seq.AttentionWrapper which Keithito used. AttentionWrapper uses previous output (mel spectrogram) AND previous attention(= context vector), but yours only use previous outputs.

With my modified version of Keithito's impl can make proper alignment, but yours cannot (Or just your impl requires more steps to make good alignment). I suspect the above mentioned difference for this result.

(One strange behavior of your implementation is that the quality of synthesized samples on test set is quite good, though their alignments are poor. With Keithito's implementation, without proper alignment, test loss is really huge.)

Do you have any idea about this? (Which one is right, concatenating previous attention or not?)

Rayhane-mamah commented 6 years ago

hello @maozhiqiang and @a3626a , Thank you for your contribution.

@maozhiqiang, The loss you're reporting is perfectly normal, actually the smaller the loss the better, which explains why the further your train your model the better the predicted Mel-spectrograms become.

The only apparent problem which is also reported by @a3626a, is that the actual state of the repository (the actual model) isn't able to capture a good alignment.

@maozhiqiang, alignments are supposed to look something like this: step-25000-align

Now, @a3626a, about that repository comparison, I made these few charts to make sure we're on the same page, and to make it easier to explain (I'm bad with words T_T).

Please note that for simplicity purposes, the encoder outputs call, the prediction part and the recurrent call of previous alignments were not represented. If your notice any mistakes, please feel free to correct me:

Here's my understanding on how keithito's Decoder works: tacotron-1-decoder

The way I see it, he is using an extra stateful RNN cell to generate the query vector at each decoding step (I'm assuming this is based on T1 where 256-GRU is used for this purpose). He's using a 128-LSTM for this RNN.

As you stated, the last decoder step outputs are indeed concatenated with the previous context vector before feeding them to the prenet (this is automatically done inside Tensorflow's attention_wrapper). Please also note that in the "hybrid" implementation keithito is using, he is not concatenating the current context vector with the decoder RNN output before doing the linear projection. (just pointing out another difference in the architecture).

Now, here's what my decoder looks like: tacotron-2-decoder

In this chart, the blue and red arrows (and terms in equations) represent two different implementations I tried separately for the context vector computation. Functions with the same name in both graphs represent the same layers (look at the end of the comment for a brief explanation about each symbol).

The actual state of the repository is the one represented in blue. i.e: I use the last decoder RNN output as query vector for the context vector computation. I also concatenate the decoder RNN output and the computed context vector to form the projection layer input.

Now, after reading your comments (and thank you for your loss plot by the way), Two possible versions came to mind when thinking of your modified version to keithito's tacotron:

First and most likely one: In case you used Tensorflow's attention_wrapper to wrap the entire decoder cell, then this chart should probably explain how your decoder is working: tacotron-hypothesis-1-decoder

here I am supposing that you are using the previous context vector in the concatenation operations. (c_{i-1}) and then update your context vector at the end of the decoding step. This is what naturally happens if you wrap the entire TacotronDecoderCell (without the alignments and attention part) with Tensorflow's attention_wrapper.

Second but less likely one: If however you did not make use of the attention_wrapper, and do the context vector computation right after the prenet, this is probably what your decoder is doing: tacotron-hypothesis-2-decoder

This actually seems weird to me because we're using the prenet output as a query vector.. Let's say i'm used to provide RNN outputs as query vector for attention computation.

Is any of these assumptions right? Or are you doing something I didn't think of? Please feel free to share your approach with us! (words should do, no need for charts x) )

So, to wrap things up (so many wrapping..), I am aware that generating the query vector using an additional LSTM gives a proper alignment, I am however trying to figure out a way that doesn't necessarily use an "Extra" recurrent layer since it wasn't explicitly mentioned in T2 paper. (and let's be honest, I don't want my hardware to come back haunting me when it gets tired of all this computation).

Sorry for the long comment, below are the symbols explained:

p() is a multi-layered non-linear function (prenet)
e_rec() stands for Extra Recurrency (attention LSTM)
Attend() is typically the attention network (refer to (content+location) attention paper for developed formulas)
rec() is the decoder Recurrency (decoder LSTM)
f() is a linear transformation
py{i}, s{i}, es{i}, y{i}, a{i} and c_{i} are the prenet output, decoder RNN hidden state, attention RNN hidden state, decoder output, alignments and context vector respectively (all at the i-th step).
h is the encoder hidden states (encoder outputs)

Note: About the quality of synthesized samples on test set, I am guessing you're referring to the GTA synthesis? It's a little bit predictable since GTA is basically a 100% teacher forcing synthesis (we provide the true frame instead of the last predicted frame at each decoding step). Otherwise, (for natural synthesis), the quality is very poor without alignment.

a3626a commented 6 years ago

Most of all, thank you for your reply with nice diagrams.

About quality of samples on test set. Though I have not tested, you are probably right. Teacher forcing was enabled in my system.
About my implementation My implementation's structure is almost identical to Keithito's. I mean 'modified' for adding more regularization methods, speaker embedding, different language with different dataset.
My future approach I will follow your direction, getting rid of extra recurrent layer for attention mechanism. In my opinion, 2-layer decoder LSTMs can do the job of extra recurrent layer. I think what to feed into _compute_attention is the key, which is not clear in the paper. (Like you did, as red arrow and blue arrow) For the start, I will feed 'previous cell state of first decoder LSTM cell'. There are 2 reason for this choice. First, I am expecting the first LSTM cell to work as an attention RNN. Second, it seems like better to feed cell state, not hidden state(output). Because, it does not require unnecessary transformations of information. In other words, hidden state(output) of LSTM cell would be more like spectrogram, not phonemes, so this must be converted back into phoneme-like data to calculate energy(or score). In contrast cell state can have phoneme-like data which can be easily compared to encoder outputs (phonemes)

Rayhane-mamah commented 6 years ago

Hello again and thank you for your answers.

"Speaker embedding" sounds exciting. I'm looking forward to hearing some samples once you're done making it!

About the attention, this is actually a nice interpretation! I can't test it out right now but I will definitely do! If you do try it out please feel free to share your results with us.

Thanks again for your contributions!

a3626a commented 6 years ago

I'm testing feeding 'previous cell state of first decoder LSTM cell', I will share the result after 1-2 days.

Thank you.

r9y9 commented 6 years ago

Wow, nice thread;) I will follow the discussion here and would like to look into your code. Thank you for sharing your work!

a3626a commented 6 years ago

First, I attached results below. In conclusion, "feeding first LSTM's cell state" does not work. According to 'Attention-Based Models for Speech Recognition', one RNN can produce output and also context vector.(or glimpse) Therefore, I think it is possible to get rid of extra RNN in Keithito's implementation(or Tacotron1).

For the next trials, 1) I will feed last LSTM's cell state, 2) I will set the initial states of decoder LSTMs as trainable parameters, not zeroes. This is mentioned in 'Attention-Based Models for Speech Recognition'.

Below results are came from my modified versions of Keithito's Tacotron2 and Rayhane-mamah's Tacotron2. No guarantee for repeatable results with the original implementations.
all results are produced by models that use reduction_factor=5. I have not succeeded to train nice alignment with reduction_factor=1, with any implementation. Unfortunately, reduction_factor seems like very important to integrate to WaveNet vocoder. It is because, I succeeded to train WaveNet vocoder with mel spectrograms from target wavs, but failed with mel spectrograms generated from ground-truth alignment from Tacotron1 with reduction_factor=5. I concluded that generated mel spectrogram could not be aligned well because of high reduction_factor.
Dataset is Korean. Encoder/Decoder axises are switched each other.

Keithito's (+more regularization methods listed in the Tacotron2 paper)
Rayhane's
Rayhane's with 'feeding first cell state'

unwritten commented 6 years ago

@a3626a , will you share your modified taco2 repo? agree that training r=1 is hard on Keithito's repo, I assume you are training soft attention, do you ever try hard attention?

a3626a commented 6 years ago

I won't share my code, but structures, hyper-parameters, and generated samples can be shared.
I'm focusing reproducing Tacotron2, so I am training soft attention like Tacotron2. I have not tried hard attention on TTS.

Rayhane-mamah commented 6 years ago

Hello everyone, good news, it's working and it's faithful to the paper! (d3dce0e05be6a86225389a38aab03710c47b7368)

step-3400-align

First of all, sorry for taking too long, I have been dealing with a leak in my watercooling so I wasn't really able to do much work in the past few days.. (it's 31 March, 11:50 PM where I live so technically it's still not the end of the month yet.. So, I'm right on time :p ) Hello @r9y9 , thanks for joining us, I wonder where I got the idea of this open discussion from. :p

Anyways, as you can see in previous plot, the attention, or the entire model actually, is working correctly now (hopefully). For the Ljspeech dataset, alignments start appearing at 2k steps, and are practically learned at 3k steps. They are still however a little noisy for long sentences but I expect them to become better with further training. Actually you can even notice that, at early stages, the model is paying more attention to areas around the second diagonal than to the rest of the matrix. (refer to the drive down below) As for the speech quality, we start understanding some words even before 1k steps but the overall audio isn't quite audible. At 3k4 steps, audio is pretty understandable, but outputs are still noisy, and more training is needed to get noise free outputs. (samples down below)

In the next few hours, I will release a document containing an in depth explanation of most of what is implemented in this repository. It's mainly for anyone who wants to have an in depth understanding of the network, or wants to know what's going on exactly in order to adapt this implementation to similar work or maybe build on it! (@a3626a it will cover the explanation about my current attention modifications based not only on Luong and Bahdanau papers, but also on Tensorflow attention tutorial.)

I do however want to draw your attention to some key points you need to be aware of before training your next model:

Mel data distribution: First, if you preprocess the ljspeech dataset using the last commit, you will notice that data distribution has changed. In fact I modified the Normalization function in the preprocessing to rescale the targets differently. In the hparams.py, you will notice symmetric_mels and max_abs_value which will let you choose how to rescale your targets. If symmetric is set to True, typically your targets will be distributed across [-max_abs_value, max_abs_value], else it will take values in [0, max_abs_value]. Defaults to [-4, 4]. I know that such choice may seem arbitrary, in depth explanation will be provided in the document, but this maneuver proved to speed up training at early stages by putting more penalization on the model mels outputs, mainly due to the nature of the loss function. (consult mel spectrograms provided down below).
Wavenet integration: Due to the preceding point, taking the T2 tacotron directly to some pretrained Wavenet model like r9y9/wavenet is impossible. There are possible solutions for that actually. One can simply rescale the mel outputs to adapt to the pretrained Wavenet. In the default case, shifting the data to [0, 8] and rescaling it to [0, 1] should do the trick, and I don't think it will affect the quality of the output in any way. If you do not have a pretrained Wavenet, I would suggest to train the model directly on the T2 output, @r9y9 can confirm if it's possible.
Training effectiveness: I also want to point out that the choice of the mel scaling is highly dependent on the data and especially on the language at hand. One can even choose to not normalize the data at all which I personally don't recommend unless you also change the regularization weights to avoid bias-ing (is that even a word..) your model. I also added a wav generation at each checkpoint step (stored under logs-Tacotron/wavs), I recommend listening to those and keep an eye on the alignments to know if your model is going the right way.
Batch size: We also experienced a case where the model was not able to capture attention when trained with a small batch size (8). This is most likely due to the noisy gradients that come with such small batch size, so we recommend that you train your models with a batch size of 32 or optimally 64.

Sorry for the long comment as always, I will simply finish by giving a heads up on what's coming up next:

Immediate:

Full integration and support for the M-AILABS dataset proposed by @imdatsolak. I have to admit that this is an amazing work you've done sir, and the female english data sounds awesome! Thank you very much for this large speech corpus!
Reorder the repository to prepare for Wavenet integration (two models link code, train and test pipelines, etc.)

Right After:

First attempt at integrating Wavenet as a vocoder for a human like TTS quality. First versions will mainly be an adaptation of r9y9/wavenet work to Tensorflow. This decision is due to the very promising results this repository has achieved so far.
Provide a pretrained Feature prediction Model.

Optional:

Add an optional post processing network that maps predicted Mel Spectrograms to Linear Spectrograms. The only purpose of adding those few layers would be to use the linear spectrograms as inputs to the griffin-lim inversion algorithm since it yields better quality than inverting mels due to their lossy nature. This will mainly be based on keithito's work.

Finally, here is the semi-trained T2 model (if we can call it that) used to generate the previous picture with all equivalent plots and wavs generated during training. I was not able to train the model much longer since my machine is actually disabled.. The model was trained on CPU so don't pay much attention to training time, it should go way faster with a GPU! Logs folder provided should also give you the ability to consult some Tensorboard stats.

If you do train a model using my work, please feel free to share your plots, observations, samples and even trained models with any language you like!

In case you encounter any problems, please notify me, I will get right to it!

unwritten commented 6 years ago

@Rayhane-mamah, do you try outputs_per_step = 1? I can get alignment when outputs_per_step = 5, but outputs_per_step = 1 can't, does this mean long sequence is too hard to train?

Rayhane-mamah commented 6 years ago

@unwritten, No I have not tried that out yet but I am most interested in knowing the answer.

If with the last commit, alignments are not learned with a reduction factor of 1, the first thing to suspect is indeed the sequence length. In that case, one could try to mask the input paddings and impute output paddings (the relevant hparams are provided).

As for the impact of the reduction factor on the wavenet quality, I am not sure if it is related or not. Working on the issue is a top priority so I will keep you informed. @r9y9, are you aware of anything like this?

In any way thank you guys for reporting this, hopefully we find a way around.

@a3626a, out of curiosity, for how long did you train the tacotron-1 model which outputs were used on the wavenet? And the failure test case you're refering to, is it related to the wavenet not being able to reproduce a high quality audio or is it related to the model loosing all language/vocal information and emitting random high quality voices?

a3626a commented 6 years ago

Steps of Tacotron-1 Between 100,000~200,000 with batch size 32, 100 speakers, 1h for each speaker. Audio quality after Griffin-Lim was not bad, perfectly audible, but little noisy (like samples from paper).
WaveNet loses all vocal information. However, feeding mel spectrogram generated from target waveform works well. (Everything except input of WaveNet was same.) I am sure that ground truth alignment (or teacher-forcing) was enabled during training.

Rayhane-mamah commented 6 years ago

@a3626a, I see, I will try to reproduce the issue as soon as possible and tell you how it goes.

EDIT: @a3626a, you said teacher-forcing was enabled during training. what about synthesis time? did you visualize the predicted mels? did you generate the outputs by feeding previously generated frames back to the decoder? If that's the case, is it possible to try synthesizing new mels with teacher forcing (Like the GTA option in my repo).

Rayhane-mamah commented 6 years ago

Quick observations sharing:

All results down below are generated using a T2 model from this repository trained on Ljspeech dataset for 6k4 steps and are prone to become much better with further training! All results down below are generated in a Natural mode ("eval" mode) with no teacher forcing, on test sentences absent from the training data! (check hparams.py) At this stage in training (which is considered still early), the model still has some pronunciation issues (e.g: the cases of "I" and "Y") (check temporary audio samples). All equivalent sentences are written inside the plots.

Let's start with something simple:

Sentence 1:

Sentence 2: ljspeech-alignment-00003 ljspeech-mel-00003

Despite being dependent on previous cumulative alignments, the model managed to make a good alignment even with no ground truth feeding, even without looking at the Mels or listening to the wav, one could deduce that the model is probably emitting a nice output.

Spectrogram plots generated during this evaluation are very similar to training spectrograms at 6k4 steps (which is used for the evaluation).

In the next examples, I want to bring your attention to the "extra" silence the model is emitting for no visible reason in the input sequence. This is probably due to the reading style in the dataset recordings:

Sentence 1:
Sentence 2:

Next, we evaluate the model on punctuation sensitivity:

Sentence 1:

-Sentence 2: ljspeech-alignment-00013 ljspeech-mel-00013

It's pretty visible that the model is simply adding some silence and attributing attention to the same token "," for multiple decoding steps when present.

Next, I wanted to check the scalability of the model on very long sequences (which explains why extended max_iters of the decoder to 1000 (just for safety in case of an infinite loop)): ljspeech-alignment-00030 ljspeech-mel-00030

The overall output is acceptable, you can however notice that at some point the model looses the attention and skips some fragments in "add this last". Training might solve the issue of attention for long sequences, but I am also thinking about implementing the attention windowing discussed here which not only reduces computation (which accelerates the model), but also limits the number of input token the model attends to at each decoder step, making the task of attention a little easier. In other words, this will give a very rough estimate of the desired alignments, thus bringing the model in the correct range faster. This has been discussed in-depth in this speech recognition paper.

Finally, knowing that all presented results are raw outputs of the model and not clipping was made, we can notice that the model is doing very well at predicting when to stop generation. A small detail is the last predicted frame in the mel spectrograms, you can see that it looks a lot like the padding frames in the training mel spectrograms. The trick was to not impute finished sentences, allowing the model to learn when to output "padding-like" frames and thus predicting the correctly. Imputing the decoder finished sequences might make the prediction a little more challenging.

Unfortunately, the model has not learned the difference between nouns and verbs or past and present (yet?). There's no much to see in the alignments or mel plots actually, but you can notice the failure case when listening to the wavs. Whether the model will eventually learn it or not highly depends on the dataset. The same applies for Capital vs small letters.

So, as a conclusion, I just wanted to point out that the repetitive frame output case (reported by @imdatsolak and @ohleo) is solved once the model knows where to "attend" when generating, and the actual results are very promising and a fully trained model should do well.

a3626a commented 6 years ago

EDIT: @a3626a, you said teacher-forcing was enabled during training. what about synthesis time? did you visualize the predicted mels? did you generate the outputs by feeding previously generated frames back to the decoder? If that's the case, is it possible to try synthesizing new mels with teacher forcing (Like the GTA option in my repo).

what about synthesis time?

GTA is not enabled, generated frames are fed back.

did you visualize the predicted mels?

No I haven't but I heard those samples generated. 1) WaveNet which is fed spectrogram from Tacotron1 ignores all local conditions, so it sounds like vanilla WaveNet which doesn't use local conditions. It babbles. 2) WaveNet which is fed spectrogram from target audio sounds okay.

If that's the case, is it possible to try synthesizing new mels with teacher forcing

I think it is worth to try.

ferintphilipose commented 6 years ago

Hi @Rayhane-mamah , Could you guide me to run the training script in GPU mode ? Currently It is using just the CPU and not utilizing the GPUs .

Rayhane-mamah commented 6 years ago

@imdatsolak, After adding the [M-AILABS speech corpus]() support, I noticed some missing wavs despite the presence of their titles inside the csv metadata (en_US version). I thought that it might interest you to know that. Running the preprocessing script of (240ccf85fcbbacbb4d1c70acfef185f23f01183c) will give you all the missing files names (as log on the terminal). Also, if you find the time, I would appreciate it a lot if you could verify that the language codes I am supporting conform to their equivalent folders names in your corpus. Thank you very much in advance.

@a3626a, Personally, I would suspect the Tacotron-1 to be failing at synthesis time.. Visualizing the predicted spectrograms is a great way of debugging this (I added this here). Other useful things we could try are GTA synthesis and try some toy griffin lim inversion. If inverted spectrograms are audible but become babbling when used on Wavenet, then there is 100% some compatibility issue somewhere, otherwise, the Spectrogram Prediction Model itself is having problems. Please keep me informed of any tests you make in case I can be on any assistance!

@ferintphilipose Hello and thank you for reaching out! By default, my implementation works on GPU if Tensorflow-gpu is installed correctly and drivers are in the correct version and working. Does your Tensorflow use GPU for other projects and not for this one?

If not, There is the installation tutorial for all supported OS, and you can follow up with this quick tutorial I made with pictures inside a Jupyter Notebook.

If you are 100% sure all your installations and CUDA drivers are on point, if you have installed your Tensorflow-gpu inside a virtual environment, please make sure to activate it when you want to run projects on GPU.

Other than that, I am not really sure what can be the problem, as the project works perfectly on my GPU.. If you however find any more information, feel free to share in case I can be of any assistance.

imdatceleste commented 6 years ago

@Rayhane-mamah, I will check for the audio files as you mentioned. There may be, indeed, missing ones but we will also do additional QA on that.

Regarding language-codes: we always use lang_Country, e.g. ru_RU, uk_UK.

In the language codes list you are using, you would probably need to change these: es-ES => es_ES ru-RU => ru_RU uk-UA => uk_UK and so on.

imdatceleste commented 6 years ago

@Rayhane-mamah, just pulled the latest commit. When I try to preprocess, I get the following error: Traceback (most recent call last): File "/usr/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/home/iso/Development/Tacotron-2/datasets/preprocessor.py", line 111, in _process_utterance assert time_steps >= T * audio.get_hop_size() AssertionError I know it has to do with the length and it seems some of the audio-files I'm using are short. But shouldn't the padding solve that automatically?

Thanks

Rayhane-mamah commented 6 years ago

@imdatsolak, Oh god I don't even know what I was thinking while typing that part.. Sorry for the typo, I was using the mel dimension (80) instead of the mel frames (length). there you go, should be good now (54593a02b73eb36ea0275184bef567e56d0a1b27).

Also thanks for the language codes rectification, if there are further mistakes, I'll correct them when upcoming languages are released. Do you have a date for the French release?

imdatceleste commented 6 years ago

@Rayhane-mamah, thanks for the bug-fix!

French: We are currently working on the French dataset. Probably within the next 3-4 weeks as our French-speaking resources are quite limited :-). But the data is ready, the text needs to be QA'd and then final QA done. Then it should be online... I'll let you know immediately. BTW: Is there a big difference between Tunisian and Saudi in Arabic? Excuse my ignorance, but I'm not so well versed myself in Arabic/dialects.

Rayhane-mamah commented 6 years ago

@imdatsolak, Great I'm really excited for the French dataset!

As for the Arabic, just like USA English and UK English have different accents and sometimes different words, Saudi Arabia's Arabic and Tunisian Arabic also differ (local language speaking). However, there is a "formal" Arabic that is common between all Arab countries and that we all understand. Since you don't have much experience with this language, I'm just gonna say that the Arabic version implemented in Apps that have TTS (like "Siri") is indeed the "formal" Arabic.

So making a single "formal" Arabic version should make much less work, and more people can help doing it. To conclude, I'm 99% sure that the data you're trying to align and clean is in the formal manner, because Arabic is usually written in this common way and read likewise.

ferintphilipose commented 6 years ago

@Rayhane-mamah Hi , Thanks a lot for clarifying my query with regard to the GPU issue. I checked it out and found that my CUDNN path was not specified to the activation path of the virtual environment I was using. Now I am able to use the GPU for running the training script. Thanks once again. :)

twidddj commented 6 years ago

@Rayhane-mamah Hi, Thanks a lot for sharing your works. Your works would be very helpful to integrate tacotron and wavenet. I want to share our some works for vocoder(here). I'll do my best in my side and share anything one if i got reasonable result.

Rayhane-mamah commented 6 years ago

Hello @twidddj and welcome!

I will make sure to look at your work. Hopefully we can help each other achieve something nice.

imdatceleste commented 6 years ago

@Rayhane-mamah, French is going into QA tomorrow and will be available at latest next week. We have reduced it to 150hrs for now (v0.9). The problem is that the remaining 50+ hours is "Marcel Proust" and "Voltaire". Our QA-People "refuse" to work on Marcel Proust for now :D ... and Voltaire is more work than we anticipated. In any case, over the next few weeks, we will add 1.0;

V0.9 also will be without normalization/transliteration (original text only). But we are working on the transliterated version as well. I thought it might be helpful to have the "raw text" for now for experimenting purposes (and by the time, I can convince our QA-people re Marcel Proust-text, we can add more :DD)

Rayhane-mamah commented 6 years ago

150hrs is awesome for a start! One can start poking around and testing few things.

Hopefully the crew will continue with the remaining 50hrs :)

Awesome work @imdatsolak, really loving this corpus! By the way, en_UK dataset is just perfect, well done!

Rayhane-mamah / Tacotron-2

Implementation Status and planned TODOs #4

Goal

Model

Feature Prediction Model (Done)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

Scripts

Extra (optional):

Notes:

Immediate:

Right After:

Optional: